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Abstract 
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equivalence  is  empirically  observable.  In  addition,  these  results  suggest  profitable  direc¬ 
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Abstract 

I  propose  a  general  algorithm  for  detecting  translational  eqnivalence  between  text  samples 
in  different  langnages.  This  algorithm  is  based  on  cnrrent  approaches  to  dnplicate  detection, 
and  it  relies  on  information  which  can  be  antomatically  learned  from  parallel  text.  1  also 
show  experimental  resnlts  which  snpport  the  hypothesis  that  translational  eqnivalence  is 
empirically  observable.  In  addition,  these  resnlts  snggest  prohtable  directions  for  improving 
performance  on  this  recognition  task.^ 
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1  Introduction 


While  the  task  of  automatic  translation  is  one  that  has  received  a  great  deal  of  attention 
in  current  literature  (e.g.,  [WS99]),  the  complementary  task  of  recognizing  translations  is 
largely  unexplored.  One  may  ask  the  question,  “given  two  linguistic  samples  E  and  F, 
does  the  predicate  Translation(E ,  F)  hold  true?”  More  generally,  one  might  produce  a 
conhdence  score  for  the  translational  equivalence  of  E  and  F. 

One  candidate  for  such  a  score  is  Pr(i?,  F)  for  some  statistical  generative  model  of 
translation.  Melamed  [MelOO]  describes  three  symmetric  word-to-word  models  which  are 
learnable  from  text  in  parallel  translation  (bi-texts).  However,  given  E  and  F,  the  com¬ 
putation  of  the  true  value  of  Pr(i?,  F)  is  expensive  because  it  requires  a  summation  over 
all  possible  word-to-word  assignments.  In  order  to  avoid  the  full  cost  of  this  computation, 
Melamed  utilizes  the  maximum  a  posteriori  (MAP)  approximation  in  his  model  estimation 
methods.  However,  an  additional  problem  presents  itself  for  this  particular  application  of 
such  models:  the  value  of  Pr(i?,i^)  diminishes  exponentially  as  the  lengths  of  E  and  F 
increase.  This  is  problematic  for  two  reasons;  one  might  wish  to  hnd  translation  pairs  in 
sets  where  the  strings  are  either  very  long  (e.g.,  documents)  or  highly  variable  in  length. 

Consider  an  example  in  which  we  wish  to  choose  the  best  French  translation  for  the 
Enghsh  sentence  “John  buys  shirts  in  Paris”  (F).  In  this  simple  case,  suppose  we  have  two 
options:  “Jean  mange”  (Fi)  and  “Jean  achete  souvent  des  chemises  a  Paris”  (F2).  Note  the 
lengths  of  these  strings: 


\E\  =  5 

|Fi|  =  2 

\F2\  =  7 

In  the  Melamed  models,  a  single  word  may  have  either  one  or  zero  corresponding  words 
in  a  generated  translation  pair.  Words  are  generated  in  pairs,  one  from  each  language,  in 
which  one  word  or  the  other  in  a  pair  may  be  null,  but  not  both  [MelOO].  This  means  that, 
if  E  and  Fi  were  generated  as  a  pair,  there  were  between  5  and  7  word  pairs  generated; 
if  E  and  F2  were  generated  as  a  pair,  there  were  between  7  and  12  word  pairs  generated. 
Clearly,  by  such  a  model,  Pr(i?,Fi)  stands  a  good  chance  of  being  signihcantly  greater 
than  Ft(F,  F2).  If  that  probability  distribution  is  the  scoring  function  for  translational 
equivalence,  performance  is  predicted  to  be  quite  horrendous. 

This  work  seeks  to  build  on  the  idea  of  a  symmetric  translation  model  as  a  useful  tool 
in  recognizing  translational  equivalence  in  text  while  avoiding  the  sentence- length  problem 
and  keeping  computational  feasibility  in  mind. 

Consideration  of  the  translation  detection  task  begs  the  question  of  what  it  means  for 
two  text-strings  to  be  mutual  translations.  It  has  been  argued  [Who73]  that  translation 
between  languages  is  not  possible.  This  investigation  seeks  to  show  that  empirically  mea¬ 
surable  properties  of  strings  suffice  to  learn  automatic  classihers  (or,  more  generally,  scoring 
functions)  to  support  the  hypothesis  that  “translation-ness”  is  an  observable  property  of 
some  bi-texts.  Following  an  intuition  made  explicit  by  Alshawi  et  al.  [ABDOO],  1  suggest 
that  a  prohtable  approach  may  be  to  avoid  artihcial  meaning  representations  in  favor  of  the 
most  natural  ones  —  the  strings  themselves. 
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1.1  Potential  Applications 

I  suggest  four  practical  applications  of  such  a  scoring  function.  The  hrst  is  in  parallel 
corpus  construction  using  systems  like  STRAND  [Res99]  and  Nie  et  al.’s  system  [NS1D99]. 
STRAND  is  a  a  tool  which  automatically  discovers  World-Wide  Web  pages  which  may 
be  mutual  translations  (in  a  given  language  pair),  then  hlters  the  candidates  based  on 
structural  similarity  evidenced  by  the  language-independent  markup  tags  present  in  the 
documents.  While  the  precision  of  STRAND  is  extremely  high,  experiments  show  that  the 
hlter  is  overly  strict,  giving  a  yield  with  room  for  improvement. 

STRAND  carries  out  two  types  of  search,  both  of  which  could  beneht  from  a  translation 
scoring  function.  In  the  hrst,  document-pairs  are  classihed  based  on  their  markup  tags.  A 
correlation  score  determines  the  likelihood  that  the  two  documents  are  parallel  text,  but  in 
some  cases  one  document  in  an  actual  translation  pair  will  have  extra  text.  This  results  in 
the  entire  document  pair  being  discarded.  If  the  text  chunks  (between  markup  tags)  could 
be  evaluated  on  their  own,  portions  of  such  asymmetric  documents  might  be  salvaged  to 
increase  the  yield  of  STRAND  without  affecting  precision. 

The  other  search  task  STRAND  addresses  is  the  pairing  up  items  from  two  lists  of  can¬ 
didate  pages.  Given  two  sets  of  documents,  STRAND  attempts  to  generate  an  assignment^ 
between  them,  but  because  of  the  high  computational  cost  of  comparing  the  contents  of  each 
possible  document  pair,  STRAND  produces  candidates  based  solely  on  the  URL  strings. 
(Nie  et  al.  [NS1D99]  and  Chen  and  Nie  [CNOO]  used  a  similar  approach.)  The  markup  hlter 
is  then  applied,  but  if  the  candidates  are  wrongly  paired,  translation  pairs  may  be  lost 
simply  because  they  weren’t  paired  based  on  URL  string  similarity. 

A  second  application  considers  text-strings  of  shorter  length;  computing  a  maximum  a 
posteriori  (MAP)  word-to-word  assignment  where  some  syntactic  information  is  available 
is  a  task  faced,  e.g.,  in  translation  modeling.  If,  for  example,  NP-bracketing  is  available 
for  both  sides  of  a  bi-text,  determining  which  contiguous  chunks  (NPs  or  extra-NP  chunks) 
correspond  with  each  other  using  a  general  classiher  would  help  to  guide  the  search.  This  ap¬ 
plication  could  be  part  of  a  framework  involving  bootstrapping  of  more  complex  translation 
models  from  simpler  ones  (e.g.  [MSOO],  [BB94]). 

A  third  application  involves  comparable  corpora.  A  comparable  corpus  contains  text 
in  multiple  languages  that  is  known  to  contain  some  similar  content,  such  as  news  from 
the  same  time  period.  While  comparable  corpora  do  not  contain  direct  translations,  a 
comparable  corpus  might  be  assumed  to  contain  some  translationally-equivalent  material. 
Extracting  this  material  might  be  a  technique  for  parallel  corpus  construction.  More  gen¬ 
erally,  ranking  potentially  translationally  equivalent  portions  of  the  corpus  could  provide  a 
means  to  weight  examples  for  some  other  learning  tasks. 

Finally,  in  multilingual  information  retrieval,  one  would  prefer  to  avoid  returning  trans¬ 
lationally-equivalent  duplicates.  If  two  documents  can  be  classihed  as  duplicates  (i.e.,  trans¬ 
lations  of  each  other),  there  exists  a  potential  for  improved  recall  in  iV-best  systems.  At 
the  same  time,  translation  detection  could  allow  a  cost  savings  when  the  translation  of  a 
document  is  desired;  if  the  translation  exists  in  a  database  and  can  be  identihed,  the  task 
of  translating  the  document  is  not  necessary  [OardOf]. 

use  the  term  “assignment”  to  refer  to  chunk-to-chunk  mappings  which  respect  no  ordering  restrictions, 
and  I  use  the  term  “alignment”  to  refer  to  such  mappings  which  do  not  allow  “cross-over.” 
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1.2  Background:  Duplicate  Detection 


Broder  et  al.  [BGMZ97]  sought  to  detect  copies  of  documents  in  a  single  language.  They 
propose  two  document  similarity  scores,  resemblance  (r)  and  containment  (c). 

In  order  to  compute  these  scores,  a  document  D  is  viewed  as  a  set  of  shingles  [Dam95] 
(a  “shingling”),  where  a  shingle  is  an  n-gram  type  (i.e.,  a  contiguous  subsequence  of  length 
n)  contained  in  D.  For  example,  the  trigram  shingling  of  the  next  sentence  is  {’’denote 
the  shingling”,  “the  shingling  of”,  “shingling  of  Ti”,  “of  T*  as”,  “T*  as  Denote 

the  shingling  of  D  as  S{D).  Resemblance  is  a  measure  in  [0, 1];  a  higher  score  indicates  a 
higher  degree  of  similarity  between  two  documents.  It  is  dehned  for  documents  Di  and  D2 
as  follows: 


r(Di,D2) 


|5’(Di)n5’(D2)| 

|5’(Di)u5’(D2)| 


(1) 


Containment  is  also  in  [0, 1].  It  indicates  a  level  of  conhdence  that  Di  is  contained  within 
D2.  It  is  dehned  as: 


c(Di,  D2) 


\S{D^)r\S{D2)\ 

\S{Dr)\ 


(2) 


Broder  et  al.  used  a  sampling  technique  for  estimating  the  shingling  of  documents;  for 
present  purposes  I  shall  not  address  this  issue,  since  most  of  my  discussion  is  directed  toward 
smaller  text  segments  for  which  samphng  is  unnecessary. 

Another  approach  to  similarity  is  given  by  Lin  [Lin98].  Lin  gives  a  theoretically- 
motivated  general  description  of  similarity: 


sim{Di,  D2) 


log  Pr[common(Di,  D2)] 
log  Pr[description(Di,  D2)] 


(3) 


That  is,  the  similarity  between  Di  and  D2  is  measured  by  the  ratio  between  the  amount  of 
information  needed  to  state  the  commonality  of  Di  and  D2  and  the  information  needed  to 
fuUy  describe  what  Di  and  D2  are. 

This  dehnition  is  in  fact  quite  similar  (no  pun  intended)  to  the  one  offered  by  [BGMZ97]. 
If  we  view  Di  as  S{Di)  and  D2  as  S{D2)^  then  the  Lin  similarity  measure  is  given  as: 


sim{Di,  D2) 


2  X  Z)gg5(£)i)n5(£)2) 

E^espi)  log  Pr(s)  +  Es6S(D2)  log  Pr(s) 


(4) 


For  purposes  of  comparison  with  the  [BGMZ97]  r  score,  note  that,  trivially: 


|N(Di)nN(D2)|  = 

E  1 

(5) 

a:eS'(L»i)n5(L»2) 

|N(Di)uN(D2)|  = 

E  1 

(6) 

:reS(Bi)uS(B2) 

While  the  two  similarity  measures  are  by  no  means  mathematically  equivalent,  one  notes 
that: 

•  The  domain  of  the  summed  items  is  {0,1}  for  [BGMZ97]  and  (  — oo,0]  (continuous) 
for  [Lin98]. 

•  The  denominator  values,  while  not  identical,  are  related.  (Note  that  |5'(i7i)u5'(i72)|  = 

|N(Di)|  +  |N(D2)|-  |N(Di)nN(D2)|.) 
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•  The  factor  of  2  in  [Lin98]  is  irrelevant  in  a  competitive  scoring  framework. 

The  key  difference  between  the  two  approaches,  for  practical  pnrposes,  is  that  sim 
assnmes  a  probabihty  model,  while  r  assnmes  discrete  sets. 

2  Translational  Equivalence  as  a  Function  Over  Sets 

The  sim  measnre  reqnires  a  probability  distribntion  over  shingle  (n-gram)  types  s.  When 
considering  docnments  in  a  single  langnage,  it  is  straightforward  to  estimate  the  parameters 
to  an  n-gram  model.  However,  it  becomes  less  clear  how  to  dehne  snch  a  probability  model 
over  shingles  when  dealing  with  mnltilingnal  docnments;  yet  this  is  reqnired  for  compnting 
the  intersection  of  shingle  sets  in  different  langnages.  The  concept  of  shingle  equality  mnst 
hrst  be  addressed. 

Let  ns  snppose  that  there  is  a  set  of  langnage-independent  concepts  C  which  is  nniver- 
sal.  In  the  text  prodnction  process,  these  concepts  are  lexicalized  and  ordered  according 
to  langnage-specihc  parameters.  Let  each  concept  c  have  a  set  Xc(c)  of  lexical  items  in 
langnage  C  which  are  candidates  in  the  lexicahzation  process. 

We  may  consider  two  elements  e  and  /  in  two  langnages  C\  and  £2;  respectively,  to  be 
translationally  eqnivalent  if  and  only  if  there  exists  a  concept  c  snch  that  e  G  Xc-^{c)  and 

/  e  XcAcf 

Unfort nnately,  the  set  of  concepts  C  is  not  directly  observable  (and  argnably  may  not 
exist).  Cnrtailing  this  issne,  let  ns  snppose  that  we  have  some  means  to  estimate  a  conhdence 
score  for  the  following  statement: 

3c  G  C  :  e  G  Xc^{c)  A  /  G  X£^(c)  (7) 

Let  the  conhdence  score  be  in  the  domain  [0,  !]:  0  indicates  an  assertion  that  e  and  /  are 
not  at  all  translationally  eqnivalent,  while  f  indicates  an  assertion  that  they  are.  Note  that 
this  is  not  a  probabihty  distribntion.  (The  discnssion  of  deriving  snch  a  conhdence  score 
from  data  follows  in  later  sections.) 

A  desirable  property  of  snch  a  score  is  that  a  shingle  e  (in  £1)  may  hold  translational 
eqnivalence  with  any  nnmber  of  shingles  /  (in  £2).  For  example,  English  nnigram  the 
is  generally  considered  to  hold  a  high  degree  of  eqnivalence  with  French  nnigrams  le,  la, 
les,  and  I’.  One-to-many  (and  many-to-many)  relationships  need  not  affect  strengths  of 
association  when  dealing  with  conhdence  scores. 

Snch  conhdence  scores,  however,  do  not  yield  the  information  needed  to  determine  the 
probability  distribntion  over  langnage-independent  concepts  as  reqnired  by  the  sim  measnre. 
Rather  than  attempting  to  develop  a  generative  model  for  entirely  nnobservable  concepts, 
1  ntilize  the  [BGMZ97]  measnre  of  resemblance  as  an  indicator  of  text  similarity. 

Resemblance,  however,  comes  with  its  own  difhcnlties.  At  hrst  blnsh,  it  appears  that  one 
mnst  dehne  operations  of  set-nnion  and  set-intersection  over  langnage-independent  shingles. 
This  is,  however  not  the  case.  The  operations  need  not  be  dehned  at  aU;  it  is  only  the 
cardinality  of  the  sets  that  is  of  interest.  Given  the  conhdence  scores,  we  may  estimate  the 

^This  discussion  ignores  entirely  the  problems  of  polysemy  and  context.  A  word  that  is  polysemous 
has  multiple  meanings  (usually  related,  such  as  English  “chicken”  the  animal  and  “chicken”  the  food).  In 
different  contexts,  words  which  are,  by  my  dehnition,  “translationally  equivalent,”  may  not  have  the  same 
meaning  at  all.  I  assume  that  the  cases  where  incorrect  conclusions  are  drawn  about  two  tokens’  translational 
equivalence,  due  to  lack  of  attention  to  context,  will  add  only  minor  noise.  This  is  an  area  requiring  further 
consideration  in  the  future. 
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cardinality  of  a  set-intersection  based  on  the  confidence  that  each  element  in  one  set  shares 
a  concept  with  any  element  in  the  other  set. 

1  nse  the  notation  E  and  F  to  refer  to  two  text  samples  in  £i  and  £2  respectively  whose 
similarity  we  seek  to  estimate.  Let  S(E)  and  S(F)  be  the  (langnage-specihc)  shingle  sets 
for  the  two  text  samples.  Hence  if: 

E  =  “Philip  drinks  coffee  with  sngar” 

F  =  “Philippe  boit  dn  cafe  avec  dn  sncre” 

then  for  trigram  shingles: 

S(E)  =  {“Philip  drinks  coffee”,  “drinks  coffee  with”, ...} 

S(F)  =  {“Philippe  boit  dn”,  “boit  dn  cafe”, ...} 

Unigram  shingle  sets  of  E  and  F  are: 

S{E)  =  {“Philip” ,  “drinks”,  “coffee”,  “with”,  “sngar”) 

S{F)  =  {“Philippe” ,  “boit”,  “dn”,  “cafe”,  “avec”,  “sncre”) 

For  a  shingle  set  S(x),  denote  the  set  of  nnderlying  concepts  lexicalized  in  the  set  by 
A(5'(a:)).  Next,  dehne  the  fnnction  fl  (S(E),  S(F))  as  the  set  of  concepts  lexicalized  in  both 

E  and  F.  This  is  eqnivalent  to  X{S{F))  fl  X{S{F)).  Likewise,  U  {S{F),  S{F))  is  the  set 
of  concepts  lexicahzed  in  either  E  oi  F  (eqnivalent  to  X(S(F))U  X(S(F))).  1  emphasize 

t  t 

that  this  approach  does  not  seek  to  estimate  the  contents  either  set.  Becanse  the  fl  and  U 
fnnctions  are  so  similar  to  normal  set  intersection  and  nnion,  1  nse  the  notation  S(E)  fl  S(F) 

t  t  ^  t 

and  S(E)  U  S(F).  Note  next  that  we  may  dehne  U  in  terms  of  fl: 

S{E)  U  N(F)|  =  \X{S{E))\  +  \X{S{F))\ -  |n(U)  n  N(F)|  (8) 

This  is  analogons  to  an  intnition  noted  in  the  previons  section  abont  the  standard  U  and  fl 
fnnctions  over  sets. 

t 

Before  dehning  |N(U)  fl  N(T)|,  1  note  the  intnition  that: 

S{E)  n  N(F)|  <  min(|A(N(U))|,  |A(N(F))|)  (9) 

That  is,  the  intersection  may  be  no  greater  in  cardinality  than  either  of  the  argnment  sets. 
1  assnme  that  |A(N(a:))|  =  |N(a:)|,  that  is,  that  exactly  one  nniqne  concept  is  lexicahzed  by 
each  shingle  in  the  set. 

Finally,  dehne  the  notation  e  =  /  as  the  translational  eqnivalence  between  shingle  e  and 
shingle  /.  In  this  discnssion,  the  generic  notation  for  the  conhdence  valne  of  the  trnth  of 
“e  =  /”  will  be  t{e,  /). 

Consider  hrst  a  simple  case  where  |N(U)|  =  f  and  |N(T)|  =  f.  In  this  scenario,  |N(U)  fl 
N(T)|  is  f  if  e  =  /  and  0  otherwise,  where  e  is  the  sole  shingle  in  S(E)  and  /  the  sole 
shingle  in  S(F).  Therefore,  whatever  conhdence  we  have  for  the  trnth  of  “e  =  /”  (i.e., 
t(e,  /))  will  be  the  estimate  for  S(E)  fl  S(F)  . 
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The  presence  of  e  in  S{F)  is  the  degree  of  conhdence  that  e  =  /  for  any  f.  I  consider 

this  conhdence  score  to  be  additive;  if  t(e,  /i)  =  0.3  and  t(e,  /2)  =  0.2  then  |{e}  fl 
is  0.5.  This  becomes  problematic  when  the  snmmation  of  conhdence  valnes  is  greater  than 
one.  Conhdence  mnst  fall  in  the  domain  [0,  f],  and  fnrther,  a  valne  greater  than  |5'(i?l)|  =  f 
is  inconsistent  with  the  intnition  in  (9).  For  this  reason,  1  dehne  the  presence  tt  of  e  in  S(F) 
as  follows: 

7r[e,5'(F)]  =  min  j  f,  ^  f(e,/)|  (10) 

\  feS(F)  J 


Finally,  the  general  case  dehnition  of 


S(F)  n  S(F) 


is: 


S(F)  n  S(F) 


i^  7r[e,5'(F)],  ^  7r[/,jS'(^)] 


es(E) 


fes{F)  J 


=  mill 


^  plip  ^  f(e,/)j\  X!  P'iF  ^  l(e,/)^(’jl) 


eeS(F)\  feS(F) 


1 


fes(F)\  ees(E) 


ontermost  minimnm  forces  the  intnition  stated  in  (9). 

t  t 

Ve  now  redehne  the  resemblance  score  r  nsing  fl  and  U: 


r(Di,D2)  = 


S(D^)  n  S(D2) 


S(D^)  U  S(D2) 

An  illnstrative  example  is  given  in  Table  f  and  Fignre  f. 


/ 

Philip 

e 

doesn’t 

drink 

tea 

Philippe 

1 

0 

0 

0 

ne 

0 

1 

0 

0 

boit 

0 

0 

1 

0 

pas 

0 

1 

0 

0 

de 

0 

0 

0 

0 

the 

0 

0 

0 

1 

Table  f:  A  sample  boolean  t  fnnction. 


(12) 


Fnrther,  we  might  also  redehne  sim  nsing  notions  dehned  in  this  section: 
sim(Di,  D2)  = 


2x^  t  logPr(s) 

seS{Di)r\S{D2) 


IDseA(s(L»i))  log  Pr(s)  +  Z)seA(5(L»2))  log  Pr(s) 


(13) 


ft  was  previonsly  noted  that  this  score  reqnires  probability  distribntions  over  shingles  that 
might  be  in  different  langnages.  By  nsing  langnage-independent  concepts  instead  of  the 
shingles  themselves,  the  problem  of  compnting  the  intersection  between  two  text  samples’ 
shingles  is  avoided.  A  new  problem  presents  itself,  however:  how  can  we  estimate  a  proba¬ 
bility  model  over  nnobservable  langnage-independent  concepts?  Seeking  to  dehne,  let  alone 
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TTphilip,  S{F)] 

7r[doesn’t,  5'(-F)] 
7r[drink,  5(-F)] 
7r[tea,  5'(-F)] 

TTphilippe,  5'(£')] 

7r[ne,  5(£')] 
7r[boit,  5(£')] 
7r[pas,  5(£')] 
7r[de,  5(£')] 
7r[the,  5(£')] 


min(l,  l  +  O  +  O  +  O  +  O  +  O) 
min(l,  0  +1  +  0+1  +  0  +  0) 
min(l,  0  +  0  +  1  +  0  +  0  +  0) 
min(l,  0  +  0  +  0  +  0  +  0+1) 
min(l,  1  +  0  +  0  +  0) 
min(l,  0  +1  +  0  +  0) 
min(l,  0  +  0+1  +  0) 
min(l,  0  +1  +  0  +  0) 
min(l,  0  +  0  +  0  +  0) 
min(l,  0  +  0  +  0  +  1) 


1 

1 

1 

1 

1 

1 

1 

1 

0 

1 


eeS(E) 

E 

leS(F) 

S{E)  n  S{F) 
S{E)  U  S{F) 


r 


4 

5 

min(4,  5) 

4 

|A(5'(J?))|+|A(5'(F))|-4 

4  +  6-4 

6 

4  _  2 
6  “  3 


Figure  1:  An  example  of  computing  the  r  score.  The  shingles  here  are  unigrams.  The 
values  of  )  come  from  Table  1.  The  text  samples  are  “Philip  doesn’t  drink  tea”  and 
“Philippe  ne  boit  pas  de  the.” 


10 


model,  these  concepts,  goes  against  the  intnition  that  the  best  hngnistic  representation  of 
a  sample  is  the  text  string  itself. 

The  newly-dehned  resemblance  score  allows  ns  to  avoid  explicitly  estimate  the  concepts 
present  in  a  text  sample  becanse  it  deals  only  with  cardinalities  of  sets.  Given  a  word-to- 
word  translational  eqnivalence  fnnction  (generically,  t),  we  can  compnte  r  for  pairs  of  text 
strings  withont  ever  resorting  to  artihcial  lingnistic  representations.  Some  t  fnnctions  on 
which  these  dehnitions  might  be  based  will  be  addressed  in  Section  .3. 

3  Word-to-Word  Equivalence  Functions 

In  order  to  nse  the  similarity  measnre  described  in  Section  2  to  estimate  the  degree  of 
translational  eqnivalence  between  a  bilingnal  pair  of  test  samples,  we  mnst  dehne  a  measnre 
of  translational  eqnivalence  between  shingles  in  the  two  langnages. 

1  consider  here  shingles  of  length  one  (i.e.,  nnigrams).  This  approach  has  several  advan¬ 
tages: 

•  ft  affords  the  fewest  problems  with  sparse  data. 

•  ft  is  likely  to  closely  mirror  actnal  translational  eqnivalence  between  text  items,  since 
the  majority  of  empirically  observable  correspondences  are  one  (word)  to  one  (word) 
[MelOO]. 

•  ft  hts  most  cleanly  with  available  data  sonrces. 

•  ft  hts  most  cleanly  with  cnrrent  translation  models. 

1  consider  three  informing  fnnctions  which  may  be  nsed  as  an  estimate  of  the  =  fnnction: 
electronic  a  priori  bilingnal  dictionaries,  antomaticaUy-learned  probabihstic  word-to-word 
translation  models,  and  cognate- ness  scores.  The  hrst  two  are  described  in  this  section;  a 
discnssion  of  cognates  is  in  Section  4. 

3.1  Bilingual  Dictionaries 

For  somelangnage  pairs,  electronic  bilingnal  dictionaries  are  available.  Previons  work  (e.g., 
[BD93])  has  viewed  snch  resonrces  as  exploitable  data,  and  they  are  particnlarly  appropriate 
for  the  task  of  identifying  translational  eqnivalence. 

A  bilingnal  dictionary  may  be  adapted  to  snit  this  pnrpose  in  a  straightforward  way. 
Let  t(e,  /)  be  a  boolean  predicate  snch  that  t(e,  /)  =  f  if  the  entry  (e,  /)  is  present  in  the 
dictionary  and  0  otherwise.  A  comprehensive  dictionary  wonld  be  expected  to  hst  most 
translationally  eqnivalent  terms,  thongh  a  few  shortcomings  are  to  be  expected: 

•  Many  bilingnal  dictionary  entries  will  not  be  one-to-one.  How  to  exploit  these  entries 
in  a  reasonable  way  when  the  element  of  interest  is  the  nnigram  is  an  open  qnestion. 

•  Bilingnal  dictionaries  may  not  contain  domain- specihc  words  and  terms;  snch  terms 
are  often  highly  informative. 

•  Morphological  variants  are  not  typically  listed  in  dictionaries,  so  withont  lemmatiza- 
tion  of  the  text  samples,  some  potentially  informative  open  class  terms  will  not  be 
fonnd  in  the  dictionary. 


If 


3.2  Translation  Models 


It  has  been  shown  that  performance  on  mnltilingnal  tasks  involving  word-level  transla¬ 
tional  eqnivalence  (e.g.,  cross-langnage  information  retrieval)  can  beneht  from  the  nse  of 
antomatically  indnced  translation  lexicons  [ROLOf]  [NSID99]. 

For  this  pnrpose,  I  ntilize  Melamed’s  [MelOO]  Model  A.  Model  A  assnmes  a  generative 
process  in  which  pairs  of  lexical  items  are  generated  in  tnrn  according  to  a  distribntion 
Pr(e,/),  prodncing  parallel  bags  of  words.  The  parameters  to  this  model  are  learned  from 
a  corpns  of  parallel  text,  aligned  at  approximately  the  sentence  level. 

For  a  pair  of  words  e  and  /,  the  valne  of  Pr(e,  /)  conld  as  well  be  taken  as  an  estimator 
t  oi  e  =  f.  This  relies  on  the  assnmption  that  the  probability  of  generating  a  pair  of 
type  (e, /)  is  directly  related  to  the  conhdence  that  e  and  /  are  translationally  eqnivalent. 
Alternately,  any  non-zero  entry  (e,  /)  in  the  translation  model  might  be  assigned  t(e,  /)  =  f, 
and  for  any  word  pair  (e,  /)  for  which  Pr(e,  /)  =  0,  t(e,  /)  =  0.  This  wonld  create  a  boolean 
t  fnnction  that  might  be  merged  with  other  snch  resonrces'^. 

Using  a  learned  statistical  translation  model  helps  to  overcome  the  problems  with 
mannally-constrncted  dictionaries: 

•  Model  A  assnmes  generation  of  concepts,  which  are  assnmed  to  be  in  a  one-one  rela¬ 
tionship  with  word-pairs.  Therefore,  for  any  Pr(e,/),  e  and  /  are  both  nnigrams. 

•  The  coverage  of  Model  A  is  determined  by  the  domain  of  the  training  corpns. 

•  Morphological  variation  is  nnknown  to  Model  A;  the  distribntion  ranges  over  lexeme 
pairs. 

It  is  important  to  highlight  that  translation  models  do  reqnire  a  resonrce:  aligned  parallel 
text.  In  general,  this  type  of  resonrce  is  more  readily  available  than  broad-domain  electronic 
dictionaries  (e.g.,  the  Bible  exists  in  electronic  form  in  nearly  every  langnage  for  which  any 
electronic  resonrces  are  available),  thongh  its  preparation  is  sometimes  non-trivial. 

One  can  imagine  a  framework  in  which  a  highly  precise  parallel  text  is  obtained  nsing 
the  STRAND  system  [Res99],  then  nsed  to  train  a  translation  model.  This  model  conld  then 
be  nsed  to  identify  additional  text  samples  that  are  translationally  eqnivalent.  By  adding 
these  samples  to  the  parallel  corpns  in  an  iterative  manner,  retraining  the  translation  model 
at  each  step,  a  larger  parallel  corpns  might  be  extracted  from  the  set  of  candidates.  This 
sort  of  framework  is  the  motivation  for  the  evalnation  task  described  in  Section  5,  thongh 
I  did  not  nndertake  its  exploration. 

4  Cognates 

The  term  “cognate”  refers  to  a  word  in  one  langnage  that  is  orthographically  or  phonetically 
similar  to  a  semantically  related  word  in  another  langnage.  An  example  is  the  cognate 
pair  English  “calendar”  and  French  “calendrier.”  By  identifying  cognates  in  a  pair  of 
text  samples,  we  may  hope  to  more  accnrately  detect  translational  eqnivalence.  The  key 
advantage  to  cognates  over  dictionary  and  translation  model^  methods  is  that  the  detection 

“^Merging  with  a  dictionary  was  the  motivation  for  this  approach,  bnt  experimental  resnlts  in  Section  7 
show  that  this  t  fnnction  can  ontperform  the  weighted  version. 

® Melamed  and  Smith  [MSOO]  have  designed  and  implemented  a  generative  translation  model  which  in- 
clndes  cognate  information. 
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of  cognates  need  not  rely  on  having  previonsly  seen  the  pair  of  words  in  translationally 
eqnivalent  contexts  prior  to  the  detection  task,  or  even  having  seen  either  word  at  all. 

Previons  work  has  shown  the  nsefnlness  of  cognates.  Siniard  et  al.  [SF192]  nsed  cog¬ 
nates  to  improve  sentence  alignment  in  parallel  corpora.  Knight  and  Graehl  [KG97]  ex¬ 
plored  transhteration  of  English  characters  to  Japanese  katakana  nsing  a  generative  sonrce- 
channel  model;  the  performance  their  system  attained  was  better  than  that  of  hnman  jndges. 
Melamed  [Mel95]  nsed  a  string  similarity  measnre  based  on  character  identity  and  string 
length  to  identify  cognates;  this  was  extended  by  Tiedemann  [Tie99].  Work  by  Smith  and 
Jahr  [WS99]  showed  that  cognate  classihers  like  Tiedemann’s  conld  be  learned  from  a  bilin- 
gnal  dictionary,  and  that  these  classihers  conld  improve  translation  models  like  that  in  the 
Candide  system  [BB94].  This  approach  is  discnssed  and  generalized  here®. 

4.1  Background 

String  similarity  metrics  have  proven  nsefnl  in  the  extraction  of  cognates  from  text.  This 
section  describes  generalizations  on  a  method  presented  in  Tiedemann  [Tie99],  which  an- 
tomatically  constrncted  langnage-specihc  string  matching  fnnctions  from  a  set  of  known 
cognate  pairs.  My  approach  constrncts  snch  a  fnnction  from  a  sentence-ahgned  parallel 
corpns,  and  it  assnmes  very  little  hngnistic  similarity  between  the  two  langnages^. 

4.1.1  LCSR  and  HSCR 

Tiedemann  [Tie99]  explored  ways  in  which  langnage- independent  versions  of  the  Least  Com¬ 
mon  Snbstring  Ratio  (LCSR,  [Mel95])  conld  be  derived  from  a  set  of  known  cognates  in 
the  langnage  pair  of  interest.  The  LCSR  is  the  ratio  of  the  longest  snbstring  of  the  char¬ 
acters  which  are  common  to  the  two  types  in  the  pair  (LCS) — this  snbset  need  not  be 
consecntive — to  the  length  (in  characters)  of  the  longer  word  in  the  pair. 

The  LCSR  can  be  calcnlated  nsing  a  dynamic  programming  techniqne.  This  is  best 
illnstrated  by  an  example  (based  on  [Tie99]).  Consider  the  English  word  seismic  (E)  and 
its  Czech  cognate  seismicky  (C).  Note  that  the  lengths,  respectively,  are  seven  characters 
and  nine  characters.  Let  denote  the  length  of  the  least  common  snbstring  (LCS)  of  the 
hrst  i  characters  of  seismic  and  the  hrst  j  characters  of  seismicky.  For  some  character¬ 
matching  fnnction  m,  the  dynamic  programming  eqnations  are  as  follows: 

Vf,  0  <  f  <  7,  lift  =  0 
Vj,  0  <  j  <  9,  lo,j  =  0 

1;  ^i—i,j—i  T  rn.(F/^,  G^)] 

(The  m  fnnction  is  boolean  for  the  LCSR:  it  is  1  if  the  argnment  characters  are  the 
same  and  0  if  they  are  not.)  The  LCSR  is  then  compnted  by  dividing  that  valne  by 
max[|seisraic|,  \seismicky\].  The  algorithm  is  illnstrated  in  the  following  matrix,  where  the 
rows  correspond  to  valnes  of  i  and  the  colnmns  to  valnes  of  j: 

®  While  this  approach  may  appear  to  be  over- kill  for  languages  that  use  the  same  script  in  similar  ways, 
like  English  and  French  (the  languages  it  was  tested  on  in  Section  7.5),  this  technique  was  developed  with 
an  eye  toward  application  to  dissimilar  orthographic  systems,  e.g.,  Arabic,  Greek,  Cyrillic,  and  Hangul.  For 
language  pairs  like  English  and  French,  LCSR  might  yield  acceptable  (or  even  better)  results. 

^Development  of  the  techniques  presented  here  was  carried  out  in  part  using  computing  resources  at  the 
University  of  Edinburgh,  Edinburgh,  Scotland. 
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The  LCS  is  7;  the  LCSR  is  g.  For  languages  that  have  similar  orthographic  systems,  like 
Swedish  and  English,  this  is  an  effective  way  of  recognizing  cognates. 

In  [Tie99],  three  types  of  independent  character  matching  functions  were  suggested  for 
m,  so  that  instead  of  computing  the  LCS  length,  the  algorithm  computes  the  highest  score  of 
correspondence  (HSC).  These  independent  matching  functions  were  generated  empirically 
based  on  a  list  of  known  cognates.  Each  function  m  was  dehned  for  each  pair  of  units,  with 
a  unit  being  either  a  character  or  an  n-gram  of  characters.  The  HSC  ratio  (HSCR)  for  two 
words  is  computed  by  dividing  the  HSC  by  the  maximum  length  of  the  two  words,  in  units. 

HSCR  is  then  taken  to  be  a  score  of  cognate-ness  for  a  pair  of  words  in  two  languages. 

4.2  Modifications  for  this  Application 

Rather  than  a  list  of  known  cognate  pairs  (an  unlikely  resource),  I  used  as  a  training  set 
the  non-zero  entries  in  a  statistical  translation  model  (see  Section  3.2)  trained  on  aligned 
parallel  bi-text.  These  pairs  are  weighted  with  scores  (probabilities)  in  [0,  f]  that  are  taken 
to  indicate  conhdence  in  translational  equivalence.  Unlike  Tiedemann,  I  exploit  these  scores 
in  estimating  the  matching  function  m.  The  details  of  the  training  algorithm  for  m  are  as 
follows. 


4.2.1  Filters 

Tiedemann  [Tie99]  utilized  several  Liters  on  bilingual  word  pairs  (found  in  a  corpus)  which 
might  be  scored  for  similarity.  Two  of  these  filters  may  be  generahzed  so  that  minimal 
assumptions  are  made  about  the  language  pair  in  question. 

The  first  filter  is  a  minimal  token  length,  which  Tiedemann  set  to  4  for  both  languages.  If 
a  language  like  Japanese  was  being  considered,  where  a  single  atomic  character  may  contain 
several  phonemes,  this  would  be  inappropriate  when  the  other  language  is  written  in  an 
alphabetic  script.  Preferably,  the  minimal  length  would  be  customized  for  each  language. 
The  remedy  I  used  was  to  set  the  minimum  at  4  for  Enghsh  {E  below)  and  estimate  a 
minimum  for  the  other  language  L  based  on  each  language’s  average  type  length: 


length 


typeLengthj^ 

typeLength^ 


length 


(14) 


The  other  constraint  of  interest  is  a  minimal  length  difference  ratio.  Tiedemann  [Tie99] 
argues  that,  because  “cognates  should  be  of  comparable  length,”  a  ratio  of  the  shorter 
string’s  length  to  the  longer  string’s  length  should  be  required  to  above  a  certain  value, 
which  he  set  at  0.7.  The  approach  here  does  the  same,  but  normalizes  each  string’s  length 
by  the  average  type  length  for  the  types  in  the  string’s  language.  The  ratio  of  the  normalized 
lengths  is  then  required  to  be  in  the  range  [^,  ^]. 
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The  score  for  any  word  pair  which  does  not  meet  these  two  criteria  is  zero,  and  any 
word  pair  in  the  training  hst  which  does  not  meet  the  criteria  is  not  considered. 


4.2.2  Counting  character  cooccurrences 

Counts  of  cooccurring  characters  from  aU  of  the  training  pairs  are  then  taken.  [Tie99] 
used  an  “estimated  position”  value  to  determine  which  characters  should  be  counted,  for 
example: 

)  =  round 

This  assumes  a  linear  relationship  between  the  characters  in  the  strings,  but  may  be  overly 
strong  when  the  training  example  is  like  the  seismic / seismicky  example  above,  in  which 
the  term  in  one  language  contains  an  affix.  Instead  of  limiting  the  character  pairs  that 
were  counted,  my  approach  biases  in  favor  of  those  characters  which  are  in  similar  positions 
relative  to  the  beginning  and  end  of  the  string: 


count(Ei,  Cj)  =  1  — 


^-7 

i  -  7 

length(E) 

length{C) 

(16) 


i  and  j  are  decreased  by  |  so  as  to  place  the  characters  between  the  integers  {0, 1, . .  .n) 
for  an  n-length  type;  this  means  that  the  relative  positions  of  all  characters  are  influenced 
by  the  length  of  the  string  containing  them  (i.e.,  the  hrst  character  is  not  at  the  absolute 
beginning  of  the  string  and  the  last  is  not  at  the  absolute  end).  Hence  in  my  method,  for 
a  given  training  word  pair,  all  pairs  in  the  cross-product  of  characters  are  counted,  but  the 
counts  are  weighted  by  the  relative  nearness  of  the  positions. 

Each  count  is  further  weighted  by  the  score  (probability)  of  the  word  pair  from  which 
it  comes.  Therefore,  low-scoring  word  pairs  from  the  translation  model  do  not  impact  the 
character  matching  function  as  greatly  as  high-scoring  word  pairs. 


4.2.3  Character  classes 

Tiedemann  [Tie99]  used  a  further  restriction  on  counting  in  this  particular  method:  vowels 
were  counted  only  with  vowels  and  consonants  only  with  consonants.  Each  character  was 
assigned  to  one  of  these  disjunct  sets.  While  this  was  quite  helpful  for  Swedish- English,  it 
would  be  inappropriate  for  two  languages  which  use  the  letters  differently  (e.g.,  u;  is  a  vowel 
in  Welsh,  but  a  consonant  in  English),  or  which  use  different  letters  (e.g.,  Russian- English). 
Further,  in  some  scenarios,  the  script  may  be  entirely  unknown,  so  that  even  if  such  classes 
(and  bilingual  correspondences  between  classes)  exist,  the  information  is  unavailable.  My 
approach  assumes  no  prior  knowledge  of  character  classes  which  might  help  in  the  task  of 
building  a  character  matching  function. 

4.2.4  Matching  function  m 

Like  Tiedemann  [Tie99],  this  approach  computes  a  Dice  score  for  each  character  pair  after 
collecting  the  counts.  The  equation  is  given  below.  This  score  is  then  used  as  the  m 
function. 


m(a:,  y)  =  Dice(x,  y) 


2r 

^^x,y 


(17) 
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4.2.5  Cognateness  as  t 

The  HSCR  may  ultimately  be  used  as  a  t  function,  i.e.,  )  =  HSCR(e,/).  Low  HSCR 

scores,  however,  are  uninformative,  since  the  score  is  really  intended  to  separate  cognate 
pairs  from  the  vast  majority  of  word  pairs  that  are  not  cognates.  For  this  reason,  and 
because  every  pair  e,  /  has  an  HSCR  score  (and  for  e  and  /  above  the  minimum  length, 
the  score  is  non-zero),  it  makes  sense  to  apply  a  threshold  r,  so  that  t(e,  f)  =  HSCR(e,/) 
if  HSCR(e,/)  >  r  and  0  otherwise.  Similarly,  the  value  of  the  HSCR  could  be  ignored 
altogether  to  create  a  boolean  t  function:  t(e,  /)  =  f  if  HSCR(e,  /)  >  r  and  0  otherwise. 

5  Evaluation  Task 

1  propose  the  following  task  to  evaluate  the  usefulness  of  a  given  t  function  in  detecting 
translational  equivalence.  Begin  with  a  segmented  corpus  Ci  in  language  £i  and  another 
segmented  corpus  C2  in  language  £2-  Let  both  Ci  and  C2  consist  of  n  segments.  Let  k  of  the 
segments  in  Ci  be  known  to  be  translationally  equivalent  respectively  to  k  segments  in  €2- 
The  remaining  n  —  k  segments  in  Ci  and  C2  are  noise  and  assumed  not  to  be  translationally 
equivalent  to  any  segments  in  the  opposite  corpus,  ft  is  assumed,  therefore,  that  every 
segment  in  either  corpus  has  either  one  or  zero  translationally  equivalent  elements  in  the 
other  corpus. 

Using  the  t  function  as  an  estimate  of  shingle  translational  equivalence  (  =  ),  a  resem¬ 
blance  score  may  be  computed  for  each  of  the  n  X  n  potentially  translationally  equivalent 
pairs  in  Ci  X  €2-  Resemblance  for  this  purpose  exploits  the  set-theoretic  functions  dehned 
in  Section  2: 

S{E)  n  S{F) 

r(E,F)  =  - ^ -  (18) 

S{E)  U  S{F) 

where  E  £  Ci,F  £  C2  and  S{E),  S{F)  are  the  sets  of  unigram  types  (n-length  shingles) 
present  in  E  and  F,  respectively. 

The  problem  is  now  reducible  to  the  maximum  weighted  bipartite  matching  problem 
[MelOO]:  given  a  bipartite  graph  G{V,  E)  (let  V  =  Ci  +  C2)  with  weighted  edges  (let 
w(E,F)  =  r(E,F)  for  all  E,F),  hnd  a  matching  M  between  the  bipartite  sets  that  maxi¬ 
mizes  The  lowest  currently  known  upper  bound  on  the  computational 

complexity  of  this  problem  is  0{ve  +  iPlogv)  for  v  vertices  and  e  edges  [AM093].  This 
reduces  to  O(n^)  for  this  task,  since  aU  n  X  n  pairs  are  scored,  creating  edges. 

Using  the  maximum  weighted  bipartite  matching  algorithm  is  reasonable  when  the  sets 
of  text  samples  are  small.  1  report  results  using  this  algorithm  (see  discussion  of  precision 
and  recall  below). 

Following  Melamed  [MelOO],  1  also  utihze  a  greedy  approximation  algorithm  to  maximum 
weighted  bipartite  matching  called  competitive  linking.  This  technique  operates  in  the 
following  manner.  Begin  with  the  set  of  candidate  pairs  Ci  X  C2.  At  each  step,  select  the 
highest-scored  pair  (E  £  Ci,  F  £  C2)®.  Mark  (E,F)  as  a  translationally  equivalent  pair 
and  remove  it  from  the  set  of  candidates,  adding  it  to  T,  a  set  of  pairs  believed  to  be 
translationally  equivalent.  Continue  until  some  stopping  condition  is  met  or  no  more  links 

®  Melamed  and  Smith  (in  progress)  describe  a  random  tie-breaking  method  when  mnltiple  candidates  have 
the  same  score,  which  I  nse. 
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are  possible.  When  the  scored  pairs  are  maintained  in  a  priority  qnene  implemented  by  a 
heap,  this  algorithm  rnns  in  O(nlogn);  it  is  snitable  for  scenarios  where  n  is  large  or  only 
the  top  pairs  are  desired. 

At  each  step  in  the  competitive  linking  algorithm,  precision  and  recall  scores  may  be 
calcnlated  for  the  set  of  marked  pairs  (E,  F).  Let  7^  be  the  snbset  of  marked  pairs  (E,  F)  G 
T  where  E  and  F  are  known  in  advance  to  be  translationally  eqnivalent. 


precision 

recall 


\r\ 


(19) 

(20) 


1  report  precision  and  recall  at  all  stages  of  completion  of  the  algorithm  nsing  precision- 
recall  plots. 


6  Shrinking  the  Search  Space 

Scoring  a  text  sample  pair  (E,F)  for  resemblance  reqnires  0(\E\  ■  |i^|)  steps.  Exhanstive 
pairwise  scoring,  then,  is  a  highly  expensive  endeavor.  Chen  and  Nie  [CNOO]  note  that 
pairs  (E,F)  that  are  highly  disparate  in  length  are  nnlikely  to  be  translational  pairs.  In 
other  words,  a  positive  correlation  between  the  lengths  of  E  and  F  where  E  and  F  are 
translationally  eqnivalent  is  to  be  expected. 

If  this  is  the  case,  the  space  of  pairs  that  to  be  scored  may  be  signihcantly  narrowed  by 
hltering  ont  pairs  where,  e.g.,  E  is  relatively  short  and  F  is  relatively  long,  or  vice  versa. 

1  trained  a  linear  regression  model  for  sentences  in  an  aligned  parallel  corpns  of  Hong 
Kong  Laws  [MaOO]  in  English  and  Chinese.  (Information  abont  the  training  and  test  corpora 
for  this  experiment  are  shown  in  Table  2.) 

Woods  et  al.  [WFH86]  describe  how  to  compnte  a  conhdence  interval  for  an  independent 
variable  valne  within  snch  a  model.  For  example,  we  might  like  to  know  with  some  level  of 
conhdence  the  range  of  valnes  for  the  length  of  the  Chinese  translation  for  a  given  English 
sentence.  Let  p  denote  the  probability  that  the  Chinese  translation  will  not  be  within  that 
range. 

1  applied  the  conhdence  interval  as  a  hlter  to  the  set  of  candidates  for  resemblance 
scoring.  Given  E  (an  English  segment),  the  set  of  candidates  involving  E  are  only  those 
pairs  (E,F)  where  |E|  is  in  the  conhdence  interval  for  |E|,  for  some  p.  A  higher  valne  of 
p  yields  a  more  strict  hlter,  eliminating  pairs  whose  lengths  are  not  close  to  the  regression 
line;  the  beneht  of  snch  a  hlter  is  a  signihcantly  rednced  search  space.  A  lower  valne  of 
p,  however,  is  more  conservative,  sacrihcing  search  space  redaction  in  favor  of  potentially 
higher  recall.  Snch  a  p  eliminates  less  pairs,  redncing  the  chances  that  some  translationally- 
eqnivalent  pairs  will  be  rnled  ont  before  resemblance  scoring. 

The  size  of  the  nnhltered  search  space  (i.e.,  the  nnmber  of  resemblance  scores  that  mnst 
be  compnted)  for  the  test  corpns  is  191^  =  36,481.  Fignre  2  shows  the  size  of  the  search 
space  remaining  after  this  length  hlter  is  applied  for  varying  valnes  of  p.  Fignre  3  shows  the 
nnmber  of  translationally  eqnivalent  pairs  remaining  in  the  hltered  search  space  for  varying 
valnes  of  p. 

Interestingly,  as  p  increases,  the  search  space  size  decays  approximately  exponentially, 
while  the  nnmber  of  correct  pairs  eliminated  increases  only  linearly.  This  is  enconraging;  it 
shows  that  compntational  savings  are  to  be  had  withont  necessarily  affecting  performance 
in  direct  relation  to  the  savings. 
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size 

(segments) 

English 

tokens 

English 

types 

Chinese 

tokens 

Chinese 

types 

mean  English 
segment  size 

mean  Chinese 
segment  size 

training 

5,622 

204,388 

5,612 

208,163 

4,517 

32.44 

33.04 

test 

191 

6,192 

1,026 

6,326 

951 

30.96 

31.63 

Table  2:  Hong  Kong  Laws  parallel  corpns  [MaOO].  The  English  text  was  tokenized  nsing 
a  script  inclnded  in  the  Egypt  distribntion  [WS99],  written  by  Dan  Melamed  and  Yaser 
Al-Onaizan.  The  Chinese  text  was  segmented  by  Clara  Cabezas  nsing  ch_seg  [Chen95] 
[L094]. 


Fignre  2:  The  beneht  of  length  hlter  is  a  redaction  in  search  space. 
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Figure  3:  The  length  hlter  eliminates  few  correct  pairs. 


7  Experimental  Results 

A  number  of  experiments  with  the  methods  described  were  carried  out.  First,  comparisons 
were  made  among  different  t  functions,  most  notably  between  functions  constructed  from 
electronic  dictionaries  and  functions  derived  from  statistical  translation  models.  The  actual 
effects  of  the  length  hlter  (Section  6)  are  shown.  Noise,  in  the  form  of  additional  text 
samples  without  their  respective  translations,  was  added  to  the  test  corpus  and  evaluation 
carried  out.  The  usefulness  of  cognates  as  a  supplement  to  other  t  functions  was  tested  next. 
Finally,  preliminary  experiments  were  carried  out  on  document  pair  candidates  discovered 
by  the  STRAND  system. 

7.1  Human  vs.  Machine:  English-Chinese 

In  this  experiment,  several  different  t  functions  were  compared.  Three  of  them  were  dehned 
as  boolean  functions  using  a  bilingual  dictionary  as  an  information  source.  Another  was 
fuzzy  (i.e.,  ranges  over  [0,  !]),  dehned  using  a  statistical  translation  model  learned  from 
parallel  text  [MelOO].  The  training  and  test  corpora  used  are  those  described  in  Table  2. 

7.1.1  Electronic  bilingual  dictionary 

Starting  with  two  directional  lexicons®,  three  different  t  functions  were  induced.  Recall  that 
the  t  function  is  dehned  (for  this  discussion)  over  pairs  of  unigram  types.  This  dictionary, 
in  its  original  form,  contained  many  entries  involving  more  than  one  English  word  and/or 

® Thanks  to  Clara  Cabezas  and  Gina  Levow  for  help  in  procnring  these  resonrces. 
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more  than  one  Chinese  word.  The  three  approaches  nsed  here  are  the  most  obvions  ways 
of  indncing  the  fnnction  from  a  dictionary. 

To  begin,  the  Chinese-to- English  dictionary^®  contained  341,187  entries,  and  the  Eng- 
lish-to- Chinese  dictionary^^  contained  394,969  entries. 

One-to-one  entries  (OO) 

1.  All  entries  involving  more  than  one  English  word  or  more  than  one  Chinese  word 
were  removed.  The  Chinese-to- English  dictionary  now  contained  232,518  entries.  The 
English-to-Chinese  dictionary  now  contained  288,709  entries. 

2.  The  two  lexicons  were  merged  into  a  single  listing.  291,454  pairs  were  in  the  nnion  of 
the  two  dictionaries. 

3.  Remove  any  entries  inclnding  words  not  fonnd  in  the  corpora  (this  step  was  carried 
ont  for  efficiency  reasons;  it  conld  not  have  affected  performance  in  any  way).  The 
hnal  dictionary  contained  9,263  entries. 

The  too  fnnction  is: 

foo(e,/)  =  (e,/)  e  Dictionary 

Cross-product  of  non-one-to-one  entries  (CP) 

1.  All  non-one-to-one  entries  from  both  dictionaries  were  collected;  the  total  was  214,947 
entries. 

2.  Any  entries  containing  (any)  words  not  present  in  the  corpns  were  removed.  The 
remaining  set  contained  6,060  entries. 

3.  The  entries  were  expanded  into  mnltiple  entries  by  taking  the  cross-prodnct  of  English 
words  with  Chinese  words.  All  elements  in  the  cross-prodnct  were  inclnded;  the  resnlt 
was  15,549  entries.  5,910  of  these  entries  were  not  present  in  the  00  dictionary. 

4.  These  entries  were  merged  with  the  00  dictionary.  The  new  dictionary  totaled  15,173 
one-to-one  entries. 

The  top  fnnction  is: 

fcp(e,/)  ^  (...e..., .../...)  e  Dictionary 

Stoplisted  cross-product  of  non-one-to-one  entries  (SLCP)  The  items  added  to 
00  in  the  cross-prodnct  process  have  a  high  potential  for  noise,  since  many  non-one-to- 
one  dictionary  entries  contain  fnnction  words  that  do  not  hold  an  eqnivalence  relation  to 
corresponding  words  on  the  other  side  of  the  entry.  (An  example  of  this  is  ‘to’  in  English 
inhnitives.)  The  noise  might  degrade  the  dictionary’s  performance;  in  order  to  lessen  this 
effect,  the  following  dictionary  preprocessing  was  applied: 

•  All  non-one-to-one  entries  from  both  dictionaries  were  collected;  the  total  was  214,947 
entries. 

^’’The  creation  of  this  dictionary  is  described  in  [LOCOO].  Thanks  to  Gina  Levow,  Dong  Oard,  and  Clara 
Cabezas  for  allowing  its  nse. 

^^The  creation  of  this  dictionary  is  described  in  [WSOO].  Thanks  to  Gina  Levow  for  offering  it  for  nse  here. 


20 


•  Any  entries  containing  words  not  present  in  the  corpns  were  removed.  The  remaining 
set  contained  6,060  entries. 

•  A  stophst  hlter  was  apphed  to  the  set.  The  English  stoplist  contained  238  common 
closed-class  words;  the  Chinese  stoplist  contained  188  fnnction  words^^.  After  the 
hlter  was  applied,  any  entries  where  the  English  or  Chinese  side  was  empty  were 
removed.  The  resnlt  was  5,555  entries. 

•  The  non-one-to-one  entries  were  expanded  into  mnltiple  entries  by  taking  the  cross- 
prodnct  of  English  words  with  Chinese  words.  All  elements  in  the  cross-prodnct  were 
inclnded;  the  resnlt  was  8,285  entries.  2,842  of  these  entries  were  not  present  in  the 
00  dictionary. 

•  These  entries  were  merged  with  the  00  dictionary.  The  new  dictionary  totaled  12,105 
one-to-one  entries. 

Let  cr(x)  be  a  boolean  predicate  taking  the  valne  ‘trne’  if  x  is  present  on  its  langnage’s 
stophst.  The  fsLCP  fnnction  is: 

t{e,  /)  =  (-i(T(e)  A  “'0'(/)  A  (...e..., .../...)  €  Dictionary)  V  (e,  /)  6  Dictionary 

7.1.2  Method  A  translation  model 

Using  Method  A  [MelOO],  a  symmetric  translation  model  was  indnced  on  the  Hong  Kong 
Laws  (see  Table  2)  training  corpns.  One  change  was  made  to  the  training  method  described 
in  [MelOO]  for  the  model:  rather  than  nse  competitive  linking,  1  nsed  the  maximnm  weighted 
bipartite  matching  algorithm  implemented  in  the  Library  of  Efficient  Data  Types  (LEDA) 
[MN99].  This  provides  a  closer  approximation  to  the  maximnm  likelihood  estimation  for 
Method  A.  A  Method  A  distribntion  contains  the  probability  of  generating  a  token  in  one 
langnage  with  a  “nnll-word”  in  the  other;  a  nnll-word  is  simply  an  empty  element  not 
observable  in  the  string.  After  the  model  was  trained,  all  entries  involving  a  nnU-word  were 
removed.  The  translation  model  nsed  contained  3,767  non-zero  entries  of  the  form  Pr(e, /) 
where  e  is  an  English  word  and  /  a  Chinese  word.  The  fnnction  is: 

fA(e,/)  =  Pr(e,/) 


7.1.3  Results 

The  evalnation  method  described  in  Section  5  was  applied  to  aU  fonr  t  fnnctions  on  the  test 
corpns  of  191  segment  pairs.  For  this  experiment,  k  =  n\  that  is,  there  is  no  added  noise. 
The  resnlts  are  shown  in  Table  3. 

Fignre  4  shows  precision-recall  plots  for  the  entire  competitive  linking  process  on  each  of 
these  t  fnnctions  as  well  as  the  maximnm  weighted  bipartite  matching  performance  for  each. 
Early  iterations  correspond  to  the  left  side  of  the  plots  (note  that  recall  can  never  decrease, 
so  the  algorithm  will  only  move  leftward  across  the  plot).  The  degradation  in  precision  as 
competitive  linking  proceeds  is  dne  to  erroneons  links  made  between  text  samples.  This 
is  expected;  each  snccessive  competitive  linking  step  links  the  next-highest  scored  pair,  so 
the  conhdence  of  the  decision  made  at  each  iteration  diminishes.  The  increase  in  recall,  of 
conrse,  is  dne  to  correct  matches  accnmnlated. 

Thanks  to  Dan  Melamed  for  the  nse  of  these  stoplists. 
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t  function 

competith 

precision 

re  linking 
recall 

MWB  m< 
precision 

itching 

recall 

00  dictionary 

0.5236 

0.5236 

0.6607 

0.5812 

CP  dictionary 

0.4660 

0.4660 

0.6176 

0.5497 

SLCP  dictionary 

0.5550 

0.5550 

0.6845 

0.6021 

Method  A  translation  model 

0.7016 

0.7016 

0.8372 

0.7539 

Table  3:  The  translation  model  ontperforms  the  dictionary,  and  maximnm  weighted  bipar¬ 
tite  matching  ontperforms  competitive  linking. 


The  translation  model  clearly  ontperforms  aU  three  dictionaries,  snbstantially.  Note 
that  the  SLCP  dictionary  achieved  higher  recall  and  precision  than  the  other  two  (00  and 
CP).  In  addition,  maximnm  weighted  bipartite  matching  offers  mnch  better  resnlts  than 
competitive  linking  in  all  cases. 


recall 


Fignre  4:  Precision  and  recall  of  fonr  t  fnnctions  throngh  the  competitive  hnking  process 
and  nnder  maximnm  weighted  bipartite  matching. 


7.1.4  Further  Exploration 

The  following  discnssion  seeks  to  investigate  why  the  translation  model  was  more  snccessfnl 
than  the  dictionaries.  Simnltaneonsly,  the  difference  in  performance  of  the  two  matching 
algorithms — competitive  linking  and  maximnm  weighted  bipartite  matching — is  considered. 

Preprocessing  effects  Degradation  of  the  dictionary  dne  to  preprocessing,  while  possi¬ 
ble,  is  nnhkely.  The  most  obvions  nses  of  the  dictionary  were  aU  applied.  Both  CP  and 
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SLCP  significantly  increased  the  size  of  the  dictionary  nsed  (by  64%  in  the  hrst  case,  by  31% 
in  the  second).  Yet  neither  offered  benehts  that  pnshed  the  performance  of  this  t  fnnction 
to  the  level  of  the  statistical  translation  model’s  performance,  ft  is  of  conrse  possible  that 
more  clever,  less  obvions  ways  to  exploit  a  dictionary  to  this  end  exist. 

Non-boolean  scores  and  ties  One  possible  reason  for  degradation  nnder  competitive 
linking  may  be  traceable  to  the  boolean  natnre  of  a  f  fnnction  derived  from  a  dictionary. 
Becanse  the  entries  are  nnweighted,  the  valnes  that  the  r  score  may  take  on  are  limited 
to  rational  nnmbers  with  denominators  <  g,  where  q  is  eqnal  to  the  maximnm  length  of 
an  English  text  sample  pins  the  maximnm  length  of  a  Chinese  text  sample.  If  the  set  of 
possible  r  valnes  is  hnite,  then  many  tied  scores  are  to  be  expected.  When  mnltiple  pairings 
are  tied,  the  competitive  hnking  algorithm  chooses  from  them  at  random  nntil  no  more 
linkings  may  be  made.  Therefore,  if  a  correct  pairing  (e,  /)  is  tied  with  many  other  pairings 
involving  e  or  /,  the  probability  of  correctly  linking  e  with  /  shrinks.  The  nnmber  of  nniqne 
scores  observed  in  the  set  of  aU  36,481  pairwise  r-scorings  is  shown  in  Table  4.  (Note  that 
precision  and  recall  are  given  at  the  completion  of  competitive  linking;  becanse  there  was 
no  noise  and  each  pair  was  linked  to  exactly  one  other  pair,  the  total  nnmber  of  linkings 
was  191.  As  a  resnlt,  hnal  precision  and  hnal  recall  were  eqnivalent.) 


dictionary 

number  of  unique  r  values 

CL  precision,  recall 

00 

779 

0.5236 

SLCP 

808 

0.5550 

CP 

939 

0.4660 

Table  4:  Tied  scores  probably  do  not  acconnt  for  degraded  performance  on  competitive 
linking. 


These  data  are  inconsistent  with  the  hypothesis  that  low  score- variability  resnlting  in 
bad  tie-breaking  is  a  factor  in  the  performance  of  competitive  hnking.  Competitive  linking 
performance  does  not  correlate  with  the  nnmber  of  distinct  scores,  ft  is  left  to  conclnde  that 
competitive  linking  is  an  approximation  that  simply  fails  to  fails  the  nnances  discovered  by 
maximnm  weighted  bipartite  matching. 

The  next  qnestion  to  be  asked  is  whether  the  weights  in  the  t  fnnction  derived  from  the 
statistical  translation  model  play  a  role  in  its  snccess  in  this  framework. 

Method  A  translation  model  without  weights  The  3,767  non-zero  entries  in  the 
translation  model  distribntion  were  stripped  of  their  probabihty  valnes,  giving  an  nn¬ 
weighted  translation  lexicon^^.  This  was  nsed  as  a  boolean  t  fnnction  on  the  English- Chinese 
experimental  task: 

^A„_igh.ed(e:/)  =  1  if  >  0:  else  0 

Resnlts  are  shown  in  Table  5;  precision-recall  plots  for  the  weighted  and  nnweighted  t 
fnnctions  are  shown  in  Fignre  5. 

is  worth  noting  that  applying  a  threshold  to  the  translation  model  entries  before  removing  weights, 
so  that  low-probability  pairs  are  not  inclnded  in  the  set  of  translationally-eqnivalent  word  pairs,  was  not 
attempted,  thongh  it  might  offer  some  beneht. 
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t  function 

competitiA 

precision 

re  linking 
recall 

MWB  m< 
precision 

itching 

recall 

Method  A  translation  model 

0.7016 

0.7016 

0.8372 

0.7539 

unweighted  translation  model 

0.8115 

0.8115 

0.9075 

0.8220 

Table  5:  Performance  of  the  translation  model  with  and  without  weights. 


recall 


Figure  5:  Precision  and  recall  of  the  translation  model  with  and  without  weights  through 
the  competitive  hnking  process  and  under  maximum  weighted  bipartite  matching. 
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This  t  function  is  without  a  doubt  noisy.  Many  of  the  entries  learned  from  the  corpus 
are  the  result  of  frequent  collocation  rather  than  translational  equivalence^'^.  Yet  remov¬ 
ing  weights  improves  performance  under  maximum  weighted  bipartite  matching  and  under 
competitive  linking,  ft  is  unclear  why  this  might  be  the  case,  though  it  might  be  conjectured 
that  the  weights  in  the  translation  model  result  in  “undue  caution”:  when  computing  the 
resemblance  score,  a  word  pair’s  t  value  might  be  quite  low,  though  the  word  pair  is  in  fact 
translationally  equivalent. 

These  experiments  suggest  that  the  translation  model  detected,  in  the  training  corpus, 
a  signihcant  number  of  term  equivalences  that  were  not  present  in  the  dictionary.  In  fact, 
the  intersection  of  listed  entries  (weights  aside)  of  the  00  dictionary  with  the  translation 
model  is  only  485  entries.  The  intersection  with  the  CP  dictionary  contained  535  entries, 
and  the  intersection  with  the  SLOP  dictionary  contained  533  entries. 

ft  is  therefore  likely  that  the  translation  model’s  performance  is  attributable  in  part 
to  domain  specihcity.  Of  the  words  in  the  corpus,  41%  of  English  types  (63%  of  tokens) 
and  72%  of  Chinese  types  (87%  of  tokens)  were  listed  in  at  least  one  dictionary  entry  (in 
the  SLCP  dictionary).  Yet  the  translation  model  is  in  stark  contrast:  of  the  words  in 
the  corpus,  55%  of  English  types,  accounting  for  a  mere  12%  of  tokens,  were  hsted  in  at 
least  one  translation  model  probability.  The  same  is  true  of  only  61%  of  Chinese  types, 
accounting  for  only  11%  of  tokens.  This  shows  that  the  translation  model,  unsurprisingly, 
induced  translational  equivalence  relations  between  highly  informative  words  not  found  in 
the  dictionary. 

Merge  of  dictionary  and  unweighted  translation  model  Finally,  because  the  inter¬ 
section  of  the  dictionary  and  the  translation  model  entries  was  so  small,  and  because  the 
weights  in  the  translation  model  seem  not  to  play  a  major  role  in  performance,  1  merged 
the  best-performing  electronic  dictionary  (SLCP)  with  the  unweighted  translation  model  to 
create  a  t  function  that  might  have  the  advantages  of  each.  Detection  results  on  the  test 
corpus  are  shown  in  Table  6;  precision-recall  plots  are  shown  in  Figure  6. 


t  function 

competiti'\ 

precision 

/e  linking 
recall 

MWB  m< 
precision 

itching 

recall 

SLCP  dictionary 

0.5550 

0.5550 

0.6845 

0.6021 

unweighted  translation  model 

0.8115 

0.8115 

0.9075 

0.8220 

union  of  translation  model  and  SLCP 

0.8220 

0.8220 

0.9048 

0.8953 

Table  6:  Performance  of  the  best-performing  preprocessed  dictionary  merged  with  the  un¬ 
weighted  translation  model. 

These  results  are  exactly  what  one  might  hope  to  hnd.  The  high  precision  of  the 
unweighted  translation  model  t  function  is  achieved,  and  recall  reaches  almost  90%,  at¬ 
tributable  to  additional  coverage  from  the  dictionary. 

7.2  Human  vs.  Machine:  English-Spanish 

A  similar  experiment  was  carried  out  on  Enghsh- Spanish  data  to  verify  that  these  hndings 
might  be  generalizable.  Table  7  shows  data  about  the  corpus  used. 

^“^For  this  task,  however,  this  information  may  be  nsefnl  nonetheless,  since  collocated,  translationally 
ineqnivalent  words  may  tend  to  collocate  freqnently. 
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1 


0.8 


0.6  h 


0.4  h 


0.2 


SLCP  dictionary 
SLCP  dictionary  (MWBM)  □ 

unweighted  t.  m.  - 

unweighted  t.  m.  (MWBM)  ^ 

combined  - 

combined  (MWBM) 


0  I - ^ ^ ^ ^ - 1 

0  0.2  0.4  0.6  0.8  1 

recall 


Figure  6:  Precision  and  recall  of  the  translation  model,  dictionary,  and  their  union  through 
the  competitive  hnking  process  and  under  maximum  weighted  bipartite  matching. 


size 

(segments) 

English 

tokens 

English 

types 

Spanish 

tokens 

Spanish 

types 

mean  English 
segment  size 

mean  Spanish 
segment  size 

training 

4,695 

141,125 

9,872 

158,094 

11,484 

29.43 

32.97 

test 

200 

6,637 

1,751 

7,549 

1,795 

33.19 

37.75 

Table  7:  English- Spanish  parallel  corpus.  The  original  text  from  which  this  was  taken  is  a 
portion  of  United  Nations  proceedings  [Graff94].  The  alignment  was  done  by  Clara  Cabezas 
using  MXTERMINATOR  [RR97].  The  English  text  was  tokenized  using  a  script  included 
in  the  Egypt  distribution  [WS99],  written  by  Dan  Melamed  and  Yaser  Al-Onaizan.  The 
Spanish  text  was  tokenized  using  a  program  written  by  Nizar  Habash  and  Bonnie  Dorr, 
with  kind  permission  of  the  authors. 
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7.2.1  Electronic  bilingual  dictionary 

A  single  bilingual  lexicon^®  consisting  of  58,413  one-to-one  entries  from  English  to  Spanish 
was  used.  Of  those  entries,  6,053  contained  words  found  in  the  corpora.  The  foict  function 
is  simply: 

fDict(e,/)  =  (e,/)  e  Dictionary 


7.2.2  Method  A  translation  model 

As  in  the  Chinese/Enghsh  experiment,  a  symmetric  translation  model  (Model  A  [MelOO]) 
was  induced  on  the  training  corpus,  using  maximum  weighted  bipartite  matching  instead  of 
competitive  linking.  (See  discussion  in  7.1.2.)  Entries  involving  a  null-word  were  removed, 
and  the  resulting  model  contained  10,227  non-zero  entries.  The  Ia  function  is  dehned  as: 

fA(e,/)  =  Pr(e,/) 

As  in  the  English- Chinese  experiment,  an  unweighted  (boolean)  version  was  also  com¬ 
puted,  such  that  ^Aunweighted  (^5 /)  =  1  if  Pr(c:/)  >  0  0  otherwise. 

7.2.3  Combined  t 

The  intersection  of  the  dictionary  with  the  translation  model  was  only  1,449  entries.  The 
two  were  therefore  merged  to  create  a  new  t  function  such  that: 

fcombined(^;  /)  —  fDict(C;  /)  V  f A„nweighted  /) 

This  combined  function  held  true  for  14,831  pairs  relevant  to  the  corpus. 

7.2.4  Results 

The  evaluation  method  described  in  Section  5  was  applied  to  aU  four  t  functions  on  the  test 
corpus  of  200  segment  pairs.  For  this  experiment,  k  =  n\  that  is,  there  is  no  added  noise. 
Table  8  compares  precision  and  recall  results  from  the  four  functions;  precision-recall  plots 
are  shown  in  Figure  7. 


t  function 

competiti'v 

precision 

re  linking 
recall 

MWB  m< 
precision 

itching 

recall 

Dictionary 

0.9050 

0.9050 

0.9282 

0.9050 

Method  A  translation  model 

0.7500 

0.7500 

0.8971 

0.7850 

unweighted  translation  model 

0.8200 

0.8200 

0.9600 

0.8400 

union  of  dictionary  and  translation  model 

0.9500 

0.9500 

0.9442 

0.9300 

Table  8:  Some  dictionaries  perform  well  as  t  functions;  the  union  of  a  dictionary  and  a 
translation  model  achieves  even  better  results. 


These  results  are  highly  similar  to  the  English- Chinese  results,  except  that  the  dictionary 
outperforms  the  translation  model.  The  most  plausible  reason  for  this  is  simply  that  the 

Thanks  to  Gina  Levow  for  providing  this  resonrce.  The  dictionary  consists  of  nnigram  English  terms 
from  the  Brown  Corpns  [FK82]  and  their  eqnivalents  retnrned  by  the  LOGOS  machine  translation  system 
[Logos].  Some  henristics  were  employed  in  the  process  to  force  certain  inflections  on  nonns,  adjectives,  and 
verbs. 
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recall 


Figure  7:  Precision  and  recall  of  four  t  functions  through  the  competitive  hnking  process 
and  under  maximum  weighted  bipartite  matching. 


Enghsh- Spanish  dictionary  used  was  either  weU-suited  to  the  genre  of  this  test  corpus  or 
had  more  extensive  coverage  than  the  dictionaries  used  in  the  English- Chinese  experiments. 

These  experiments  on  English- Chinese  and  Enghsh- Spanish  corpora  show  that  transla¬ 
tional  equivalence  is  detectable  in  text,  and  that  statistical  models  offer  a  means  to  carry 
out  that  detection.  This  is  a  useful  result,  as  parallel  text  (required  for  translation  model 
training)  is  a  less  expensive  resource  than  bilingual  dictionaries,  which,  though  in  some 
cases  may  offer  even  higher  levels  of  performance  on  the  detection  task,  are  variable  in  their 
effectiveness. 

7.3  Length  Filter 

Using  three  of  the  English- Chinese  t  functions  from  the  experiment  in  Section  7.1,  1  apphed 
the  length  hlter  from  Section  6  at  various  values  of  p.  This  technique  reduces  the  amount 
of  time  required  for  scoring  text  sample  pairs  (computing  the  resemblance  score  r)  by 
eliminating  pairs  where  the  length  of  the  Chinese  segment  is  outside  some  conhdence  interval 
for  the  length  of  the  English  segment,  given  a  linear  regression  model.  The  linear  regression 
model  is  learned  from  the  same  training  corpus  as  the  translation  model. 

Table  9  shows  precision  and  recall  after  application  of  the  length  hlter  at  various  values  of 
p  for  the  SLCP  dictionary,  and  Figure  8  shows  precision  and  recall  throughout  competitive 
linking  and  under  maximum  weighted  bipartite  matching.  Table  10  and  Figure  9  show  the 
same  for  the  Method  A  translation  model  (unweighted).  Table  11  and  Figure  10  show  the 
same  for  the  union  of  the  SLCP  dictionary  and  the  unweighted  translation  model.  The 
values  of  p  chosen  are  small,  since  the  beneht  (search  space  size  reduction)  levels  out  as  p 
increases  (see  Figure  2). 
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It  is  important  to  note  that,  because  the  hlters  eliminate  some  correct  pairs,  recall  is 
bounded  by  the  number  of  correct  pairs  that  pass  the  biter.  For  example,  for  the  p  =  0.05 
length  biter,  184  correct  pairs  pass  the  biter,  so  recall  is  limited  to  188/191  =  0.9843.  Sim¬ 
ilarly,  the  limits  for  p  =  0.05, p  =  0.1, p  =  0.15  are  0.9634,  0.9529,  and  0.9476,  respectively. 
For  this  reason,  competitive  linking  results  are  shown  for  these  experiments  at  that  critical 
iteration  step;  for  each  biter,  the  step  denoted  as  “max”  is  the  last  one  for  which  preci¬ 
sion  may  potentially  be  equal  to  1  (i.e.,  188  for  p  =  0.01,  etc.).  Beyond  that  iteration, 
precision  cannot  increase.  It  is  recognized  that  a  user  of  this  technique  cannot  know  the 
number  of  correct  pairs  eliminated  by  the  length  biter,  though  under  competitive  linking, 
the  maximum  precision  score  is  observable. 


length  biter 

ma 

precision 

competiti 

X 

recall 

ve  linking 
complc 
precision 

tion 

recall 

MWB  m< 
precision 

itching 

recall 

none 

0.5550 

0.5550 

0.5550 

0.5550 

0.6845 

0.6021 

II 

o 

o 

1—^ 

0.5638 

0.5550 

0.5550 

0.5550 

0.6864 

0.6073 

p  =  0.05 

0.5815 

0.5602 

0.5654 

0.5654 

0.6923 

0.6126 

p  =  0.1 

0.5934 

0.5654 

0.5707 

0.5707 

0.6568 

0.5812 

p  =  0.15 

0.5856 

0.5550 

0.5550 

0.5550 

0.6627 

0.5864 

Table  9:  Effects  of  the  length  biter  on  translation  detection  using  a  bilingual  dictionary 
(SLCP). 
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Figure  8:  Precision  and  recall  of  the  SLCP  dictionary  are  hardly  affected  by  length  bitering. 


Under  both  matching  algorithms,  the  biter  did  not  affect  performance  of  the  SLCP 
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dictionary  greatly.  In  many  cases  performance  improved,  and  in  cases  where  it  was  degraded, 
the  decrease  was  less  than  0.03  for  both  precision  and  recall. 


length  hlter 

ma 

precision 

competiti 

X 

recall 

ve  linking 
comph 
precision 

tion 

recall 

MWB  m< 
precision 

itching 

recall 

none 

0.7016 

0.7016 

0.7016 

0.7016 

0.8372 

0.7539 

II 

o 

o 

0.80.32 

0.7906 

0.7906 

0.7906 

0.9017 

0.8168 

p  =  0.05 

0.8043 

0.7749 

0.7749 

0.7749 

0.8830 

0.7906 

p  =  0.1 

0.8242 

0.7853 

0.7853 

0.7853 

0.8772 

0.7853 

p  =  0.15 

0.8232 

0.7801 

0.7801 

0.7801 

0.8764 

0.7801 

Table  10:  Effects  of  the  length  hlter  on  translation  detection  nsing  an  nnweighted  statistical 
translation  model. 
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Fignre  9:  Precision  and  recall  of  the  nnweighted  translation  model  are  hardly  affected  by 
length  hltering. 


Performance  of  the  nnweighted  translation  model  was  improved  nnder  both  matching 
algorithms  for  all  hlters.  The  resnlts  thns  far  snggest  that  the  length  hlter  not  only  offers  a 
compntational  savings,  bnt  it  also  gnides  the  matching  process  by  eliminating  pairs  nnlikely 
to  be  linked. 

The  merged  t  fnnction’s  performance  did  not  experience  the  same  benehts  throngh 
application  of  a  length  hlter.  Some  slight  increase  in  precision  was  to  be  had  (at  very  slight 
and  nnsnrprising  loss  in  recall)  nnder  competitive  linking. 

It  is  clear  that  redncing  the  search  space  (and  therefore  rnn  time)  nsing  a  length  hlter 
does  not  greatly  diminish  performance  when  the  text  samples  are  sentence-sized,  particn- 
larly  nnder  competitive  linking.  In  fact,  nnder  some  conditions  it  improves  performance.  It 
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length  filter 

ma 

precision 

competiti 

X 

recall 

ve  linking 
comp  If 
precision 

tion 

recall 

MWB  m< 
precision 

itching 

recall 

none 

0.8220 

0.8220 

0.8220 

0.8220 

0.9048 

0.8953 

II 

o 

o 

1—^ 

0.8298 

0.8168 

0.8168 

0.8168 

0.8783 

0.8691 

p  =  0.05 

0.8370 

0.8063 

0.8370 

0.8370 

0.8670 

0.8534 

p  =  0.1 

0.8407 

0.8010 

0.8115 

0.8115 

0.8457 

0.8325 

p  =  0.15 

0.8398 

0.7958 

0.8063 

0.8063 

0.8670 

0.8534 

Table  11:  Effects  of  the  length  hlter  on  translation  detection  nsing  an  nnweighted  statistical 
translation  model  merged  with  the  SLCP  dictionary. 
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Fignre  10:  Precision  and  recall  of  the  merged  t  fnnction  are  hardly  affected  by  length 
hltering. 
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is  concluded  that  this  is  a  useful  technique  allowing  for  faster  completion  of  the  detection 
task. 


7.4  Robustness  to  Candidate  Noise 

ft  is  unrealistic  to  suppose  that  the  task  of  selecting  translation  pairs  from  a  set  of  candidates 
will  typically  involve  sets  of  text  samples  in  two  languages  where  every  sample  is  known  to 
have  exactly  one  translationally  equivalent  partner  in  the  opposite  set.  ft  is  to  be  expected 
that  there  will  be  some  noise,  e.g.,  text  samples  which  are  not  part  of  translation  pairs. 
These  make  the  task  more  difficult  by  increasing  the  number  of  candidates  quadratically 
and  possibly  affecting  both  precision  and  recall. 

In  order  to  test  the  robustness  of  this  method  to  such  noise,  1  randomly  selected  573 
Enghsh  text  samples  and  573  Chinese  text  samples  from  the  same  Hong  Kong  Laws  corpus 
that  the  test  and  training  corpus  came  from.  These  samples  were  of  comparable  length 
(34.64  average  sample  length  for  English,  36.73  for  Chinese)  and  genre  to  the  corpus  in 
Table  2.  The  English  samples  were  taken  from  a  different  region  of  the  corpus  than  the 
Chinese  samples  to  avoid  choosing  samples  that  might  be  translationally  equivalent. 

Incrementally,  the  noise  samples  were  added  to  the  test  corpus  of  k  =  191  pairs^®,  and 
the  resulting  candidate  pairs  were  scored.  Maximum  weighted  bipartite  matching  is  not  a 
useful  algorithm  for  this  task  because  it  seeks  to  link  as  many  text  sample  pairs  as  possible 
to  maximize  the  sum  of  the  scores.  Doing  this  will  inevitably  hurt  precision;  if  k  =  191, 
then  generating  more  than  n  links  will  affect  precision  for  the  worse. 

In  order  to  lessen  the  computational  cost,  a  length  hlter  with  p  =  0.1  was  applied.  Table 
12  shows  the  reduction  in  the  search  space  through  the  application  of  this  hlter.  This  hlter 
was  selected  because  it  offered,  overall,  the  greatest  improvement  in  performance  under 
competitive  linking  (see  results  of  the  length  hlter  experiments  in  Section  7.3). 


n  in  —  k) 

number  of  candidates 

number  of  candidates  after  filter 

287  (96) 

82,369 

37,573 

382  (191) 

145,924 

66,173 

573  (382) 

328,329 

144,950 

764  (573) 

583,696 

246,431 

Table  12:  Reduction  in  candidates  via  length  hlter. 


Three  t  functions  were  used:  the  SLCP  dictionary,  the  Method  A  translation  model 
(unweighted),  and  the  union  of  the  two.  Results  with  the  dictionary  are  shown  in  Table  13, 
and  Figures  11  and  12  show  precision  and  recall  (respectively)  throughout  the  competitive 
linking  process.  Precision  and  recall  are  separated  and  each  is  plotted  against  competitive 
linking  iteration  to  highlight  the  importance  of  stopping  the  competitive  linking  process  at 
the  appropriate  time.  Analogous  results  with  the  translation  model  are  shown  in  Table  14 
and  Figures  13-14,  and  results  for  the  union  are  shown  in  Table  15  and  Figures  15-16. 

Note  that  after  191  links  are  made,  it  is  unhkely  that  further  iterations  (in  the  competi¬ 
tive  linking  process)  will  improve  precision  or  recall,  as  only  191  of  the  original  possible  links 
are  correct.  For  this  reason,  competitive  linking  results  are  shown  only  for  191  iterations. 
Iteration  182  is  shown  because  it  is  known  (in  the  experiment)  to  be  the  point  at  which 

Following  the  discussion  in  Section  5,  k  refers  to  the  number  of  translationally  equivalent  pairs,  and  n 
refers  to  the  total  number  of  text  samples  in  each  language.  The  number  of  noise  added  is  n  —  k. 
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precision  cannot  increase,  since  it  is  known  that  only  182  correct  pairs  passed  the  p  =  0.1 
length  hlter.  Therefore,  recall  cannot  pass  0.9529. 


noise  level  (n  —  k) 

iteratio 

precision 

n  182 
recall 

iteration 

precision 

191  {k) 
recall 

0%  (0) 

0.5934 

0.5654 

0.5707 

0.5707 

33%  (96) 

0.5824 

0.5550 

0.5550 

0.5550 

50%  ^91) 

0.5879 

0.5602 

0.5602 

0.5602 

67%  (382) 

0.5824 

0.5550 

0.5550 

0.5550 

75%  (573) 

0.5879 

0.5602 

0.5602 

0.5602 

Table  13:  Effects  of  noise  on  the  SLCP  dictionary  t  fnnction. 


Fignre  11:  Precision  of  the  SLCP  dictionary  is  hardly  affected  by  noise.  The  vertical  lines 
mark  the  points  where  competitive  linking  ceases  for  snccessively  noisier  scenarios;  e.g., 
linking  stops  at  iteration  287  when  there  are  total  287  (191  good  pins  96  noise)  pairs  to  be 
linked. 


These  resnlts  show  that  the  SLCP  t  fnnction  is  robnst  to  noise,  even  when  the  amonnt  of 
noise  is  three  times  the  amonnt  of  correct  pairs  present.  Note  that  hnking  past  182  resnlts 
only  in  decreased  precision.  Of  conrse,  in  practical  applications,  the  nnmber  of  correct  pairs 
that  passed  the  hlter  wiU  be  nnknown,  as  wiU  be  k,  the  nnmber  of  translationally  eqnivalent 
pairs  present.  Determining  when  to  stop  the  hnking  process  is  not  addressed  here,  thongh 
it  follows  from  the  competitive  linking  resnlts  that  choosing  a  threshold  will  snffice.  Snch  a 
threshold  conld  be  chosen  empirically  by  hnding  the  range  of  r  scores  for  known  translation 
pairs.  Alternately,  the  percentage  of  noise  in  the  candidates  conld  be  estimated  and 

only  the  top  k  hnks  taken. 
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Figure  12:  Recall  of  the  SLCP  dictionary  is  hardly  affected  by  noise.  The  vertical  lines 
mark  the  points  where  competitive  linking  ceases  for  successively  noisier  scenarios. 


noise  level  (n  —  k) 

iteration  182 
precision  |  recall 

iteration  191  {k) 
precision  |  recall 

0%  (0) 

0.8242 

0.7853 

0.7853 

0.7853 

33%  (96) 

0.5659 

0.5393 

0.5550 

0.5550 

50%  (191) 

0.5550 

0.5288 

0.5340 

0.5340 

67%  (382) 

0.5550 

0.5388 

0.5288 

0.5288 

75%  (573) 

0.5495 

0.5236 

0.5236 

0.5236 

Table  14:  Effects  of  noise  on  the  translation  model  (unweighted)  t  function. 
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Figure  13:  Precision  of  the  unweighted  translation  model  is  heavily  affected  by  noise.  The 
vertical  lines  mark  the  points  where  competitive  hnking  ceases  for  successively  noisier  sce¬ 
narios. 


Figure  14:  Recall  of  the  unweighted  translation  model  is  heavily  affected  by  noise.  The  ver¬ 
tical  lines  mark  the  points  where  competitive  linking  ceases  for  successively  noisier  scenarios. 
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The  translation  model’s  performance  is  greatly  affected  by  the  addition  of  noise.  Both 
precision  and  recall  are  markedly  diminished  early  in  the  competitive  hnking  process  and 
remain  so  thronghont.  However,  performance  does  not  diminish  greatly  as  the  amonnt  of 
noise  increases. 

One  possibility  for  the  decrease  in  precision  and  recall  of  this  t  fnnction  is  the  “noise” 
in  the  fnnction  itself.  Recall  that  his  fnnction  was  formed  by  removing  weights  from  sta¬ 
tistical  translation  entries.  While  many  entries  in  the  translation  model  are  translationally 
eqnivalent  pairs,  some  are  not.  Presnmably  higher  weights  will  correlate  with  translational 
eqnivalence.  Therefore,  by  removing  the  weights  (i.e.,  making  t  a  boolean  fnnction),  in¬ 
correct  word  pairings  will  be  given  the  same  attention  as  their  correct  connterparts.  ft 
is  possible  that  this  method  is  robnst  to  either  noise  in  the  candidates  or  noise  in  the  t 
fnnction,  bnt  not  both.  To  test  this,  1  carried  ont  the  same  experiment  with  the  weighted 
translation  model  as  t  fnnction,  inclnding  the  baseline  case  with  no  noise.  The  resnlts  are 
not  given  here  becanse  the  hypothesis  proved  incorrect:  not  only  did  the  weighted  trans¬ 
lation  model  fail  to  achieve  the  same  performance,  as  the  nnweighted  version  (as  in  the 
original  experiment  with  no  length  hlter,  see  Section  7.1.3),  bnt  it  degraded  sharply  as 
candidate  noise  increased. 


noise  level  (n  —  k) 

iteratio 

precision 

n  182 
recall 

iteration 

precision 

191  {k) 
recall 

0%  (0) 

0.8407 

0.8010 

0.8115 

0.8115 

33%  (96) 

0.4670 

0.4450 

0.4555 

0.4555 

50%  (191) 

0.4121 

0..3927 

0.3927 

0.3927 

67%  (382) 

0.3352 

0.3194 

0.3246 

0.3246 

75%  (573) 

0.2747 

0.2618 

0.2618 

0.2618 

Table  15:  Effects  of  noise  on  the  translation  model  (nnweighted)  t  fnnction  merged  with 
the  SLCP  dictionary  t  fnnction. 


Performance  of  the  merged  t  fnnction  degrades  mnch  more  qnickly  than  the  translation 
model  alone,  as  noise  increases.  In  addition,  precision  and  recall  never  reach  the  same  level 
as  either  the  SLCP  dictionary  or  the  translation  model  alone. 

Three  key  conclnsions  may  be  drawn  from  this  experiment.  First,  some  t  fnnctions 
may  be  chosen  that  are  robnst  to  noise  in  the  candidate  data  (e.g.,  SLCP  English- Chinese 
dictionary);  snch  relations  perform  almost  as  well  in  noisy  conditions  as  in  noiseless  ones. 
Second,  some  snch  fnnctions  do  not  diminish  in  performance  as  the  level  of  noise  increases 
(e.g.,  SLCP  dictionary  and  the  nnweighted  Method  A  translation  model).  Finally,  the 
translation  model  (nnweighted)  performed  almost  as  well  as  the  SLCP  dictionary  in  noisy 
conditions.  While  the  resnlts  from  this  section  snggest  that  the  best-performing  t  fnnctions 
are  not  necessarily  resilient  enongh  to  be  nsefnl  in  noisy  conditions,  there  is  promise  that 
noise  in  the  candidates  need  not  resnlt  in  performance  loss.  Finding  fnnctions  that  perform 
well  and  hold  np  to  noise  is  left  for  fntnre  research,  thongh  the  English- Spanish  resnlts  in 
Section  7.2  snggest  that  some  bilingnal  dictionaries  may  ht  the  bill. 

7.5  Cognates:  A  Bonus 

In  Section  4,  a  method  for  deriving  a  score  of  cognate-ness  between  words  (HSCR)  was 
described,  ft  was  snggested  that  nsing  that  score  as  a  f  fnnction  was  worth  consideration. 
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Figure  15:  Precision  of  the  merged  t  function  is  heavily  affected  by  noise.  The  vertical  lines 
mark  the  points  where  competitive  linking  ceases  for  successively  noisier  scenarios. 


Figure  16:  Recall  of  the  merged  t  function  is  heavily  affected  by  noise.  The  vertical  lines 
mark  the  points  where  competitive  linking  ceases  for  successively  noisier  scenarios. 
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Here  an  experiment  on  English- French  (a  cognate-rich  pair)  is  described  and  resnlts  re¬ 
ported.  The  evalnation  task  here  is  the  same  as  that  in  the  other  experiments;  no  noise  was 
added  and  no  length  hlter  was  applied. 

Facts  abont  the  corpns  nsed  for  training  the  translation  model  (details  below)  and  the 
corpns  nsed  for  testing  are  shown  in  Table  16. 


size 

(segments) 

English 

tokens 

English 

types 

Erench 

tokens 

Erench 

types 

mean  English 
segment  size 

mean  Erench 
segment  size 

training 

5,963 

153,199 

7,993 

168,361 

10,529 

25.53 

28.06 

test 

200 

3,788 

917 

3,858 

920 

18.21 

18.55 

Table  16:  English- French  Bible  parallel  corpns.  The  Enghsh  and  French  text  was  tokenized 
nsing  scripts  inclnded  in  the  Egypt  distribntion  [WS99],  written  by  Dan  Melamed  and  Yaser 
Al-Onaizan. 


7.5.1  Electronic  bilingual  dictionary 

A  t  function  was  derived  from  an  electronic  English-French  dictionary,  nsing  the  same 
methods  described  in  Section  7.1.1.  To  begin,  the  French-to- English  dictionary^^  contained 
34,804  entries.  The  entries  were  reversed  to  be  English-to-French  for  this  purpose,  and  each 
entry  was  tokenized  using  tokenization  scripts  for  these  languages  included  in  the  Egypt 
distribution  [WS99]. 

One-to-one  entries  (OO) 

1.  All  entries  involving  more  than  one  English  word  or  more  than  one  French  word  were 
removed.  30,783  entries  remained. 

2.  Any  entries  including  words  not  found  in  the  candidate  pairs  were  removed  (this  step 
was  carried  out  for  efficiency  reasons;  it  could  not  have  affected  performance  in  any 
way).  The  hnal  00  dictionary  contained  4,462  entries. 

Stoplisted  cross-product  of  non-one-to-one  entries  (SLCP) 

•  All  non-one-to-one  entries  from  the  dictionary  were  collected;  the  total  was  4,021 
entries. 

•  Any  entries  containing  words  not  present  in  the  corpus  were  removed.  The  remaining 
set  contained  837  entries. 

•  A  stophst  hlter  was  apphed  to  the  set.  The  English  stoplist  contained  238  common 
closed-class  words;  the  French  stophst  contained  311  words^®.  After  the  hlter  was 
applied,  any  entries  where  the  English  or  French  side  was  empty  were  removed.  The 
result  was  673  entries. 

^^This  dictionary  was  derived  by  Gina  Levow  from  an  English-to-French  one  freely  available  at  http://- 
www.freedict.com. 

Thanks  to  Dan  Melamed  for  the  nse  of  these  stoplists. 
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•  The  non-one-to-one  entries  were  expanded  into  mnltiple  entries  by  taking  the  cross- 
prodnct  of  English  words  with  French  words.  All  elements  in  the  cross-prodnct  were 
inclnded;  the  resnlt  was  1,210  entries.  Ill  of  these  entries  were  not  present  in  the  00 
dictionary. 

•  These  entries  were  merged  with  the  00  dictionary.  The  new  dictionary  totaled  5,561 
one-to-one  entries. 

The  SLOP  dictionary,  in  preliminary  tests,  ontperformed  the  00  dictionary,  so  it  was 
the  only  one  nsed  in  this  experiment. 

7.5.2  Automatically  learned  t  functions 

Another  set  of  t  fnnctions  came  from  the  set  of  non-zero  entries  from  a  Method  A  translation 
model  trained  on  5,963  aligned  Enghsh-French  verses  of  the  Bible.  These  were  selected  at 
regnlar  intervals^®  thronghont  the  Bible  from  the  nnannotated  section  of  the  Blinker  corpns 
[Mel98].  Table  16  shows  facts  abont  the  training  corpns  for  the  translation  model.  There 
were  9,593  non-zero  entries  in  the  translation  model.  As  before,  one  t  fnnction  was  dehned 
as  )  =  Pr(e,/).  The  same  9,593  entries  from  the  translation  model  were  also  nsed 
withont  weights  (i.e.,  )  =  1  if  and  only  if  Pr(e, /)  is  non-zero.  As  before,  a  merged 

t  fnnction,  the  nnion  of  the  nnweighted  translation  model  with  the  SLCP  dictionary,  was 
created;  it  contained  14,446  non-zero  entries. 

Using  the  techniqne  from  Section  4,  an  m  fnnction  was  indnced  nsing  the  translation 
model.  This,  in  tnrn,  was  nsed  to  compnte  HSCR  cognate-ness  scores  for  aU  word  pairs 
in  the  cross-prodnct  of  the  set  of  English  types  with  the  set  of  French  types.  An  ad  hoc 
threshold  of  0.2  was  selected;  this  left  a  set  of  340,213  scored  word  pairs  in  the  set  of 
candidates  for  which  f  >  0. 

Thongh  this  set  was  highly  noisy  (i.e.,  many  pairs  were  not  cognates),  prehminary 
comparisons  with  higher  thresholds  showed  that  0.2  was,  for  this  m  fnnction  and  data 
set,  reasonable^®  The  t  fnnction  was  t{e,  f)  =  HSCR(e, /)  if  HSCR(e,/)  >  0.2  and  0 
otherwise,  and  the  nnweighted  version  was  t(e,  /)  =  HSCR(e,  /)  if  HSCR(e,  /)  >  0.2  and  0 
otherwise.  Table  17  shows  some  sample  word  pairs  from  the  test  corpns  and  their  HSCR 
scores.  Becanse  biblical  text  is  rich  in  names,  the  m  fnnction  adeptly  captnres  character- 
to-character  mappings.  Note  the  many  name  pairs  from  the  test  corpns  with  high  scores. 

7.5.3  Results 

Table  18  shows,  under  both  competitive  linking  and  maximum  weighted  bipartite  matching, 
precision  and  recall  of  these  t  functions  on  the  detection  task.  Precision-recall  plots  for  these 
t  functions  are  shown  in  Figures  17  and  18. 

For  this  data  set,  the  cognate  function  is  the  best-performing  t  function,  though  the 
translation  model  (without  weights)  nearly  reaches  its  level  of  precision  under  maximum 
weighted  bipartite  matching.  The  SLCP  dictionary’s  performance  was  strikingly  low;  this 
may  be  due  in  part  to  the  absence  of  accents  in  the  dictionary  (accents  were  present  in  the 
corpus). 

^®For  this  task  I  used  the  Whittle  tool  by  Mike  Jahr  included  in  the  Egypt  distribution  [WS99]. 

^^Note  that  a  “good”  HSCR  score  for  a  word  pair  is  entirely  dependent  on  the  m  function,  which  in  turn 
depends  entirely  on  the  training  corpus.  This  threshold  cannot  be  said  to  be  generally  applicable.  Choosing 
a  good  threshold  is  an  empirical  question  left  unexplored  here. 
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rank 

HSCR 

English 

French 

1 

0.5022 

Jobab 

Jobab 

2 

0.4814 

Jabal 

Jabal 

3 

0.4809 

Jarib 

Jarib 

4 

0.4790 

Joram 

Joram 

1,001 

0.3551 

Jorah 

Jaera 

1,002 

0.3550 

Menahem 

Menahem 

1,003 

0.3550 

Mahath 

Maath 

1,004 

0.3550 

Hadid 

Hadid 

8,001 

0.2968 

Ahuzzam 

Achuzzam 

8,002 

0.2968 

Dishon 

Disent 

8,003 

0.2968 

Jorai 

Joan an 

8,004 

0.2968 

Jaddai 

Joan an 

64,001 

0.2430 

vapor 

evapore 

64,002 

0.2430 

Jabesh 

J  aroach 

64,003 

0.2430 

Caraway 

Conania 

64,004 

0.2430 

Branches 

Barabbas 

340,210 

0.2000 

Dedication 

variation 

340,211 

0.2000 

Separate 

egarants 

340,212 

0.2000 

separate 

egarants 

340,213 

0.2000 

Pagiel 

Publie 

Table  17:  Examples  of  English- French  cognate- ness  (HSCR)  scores.  These  examples  were 
chosen  arbitrarily  to  show  a  range  of  scores  and  noisiness. 


t  function 

competiti'\ 

precision 

/e  linking 
recall 

MWB  m< 
precision 

itching 

recall 

SLCP  dictionary 

0.2400 

0.2400 

0.3393 

0.2850 

Method  A  translation  model 

0.8800 

0.8800 

0.9677 

0.9000 

unweighted  translation  model 

0.9100 

0.9100 

0.9840 

0.9200 

union  of  SLCP  and  unweighted  t.  m. 

0.8750 

0.8750 

0.9538 

0.9300 

HSCR  (0.2  threshold) 

0.9450 

0.9450 

0.9949 

0.9800 

unweighted  HSCR  (0.2  threshold) 

0.9400 

0.9400 

0.9695 

0.9550 

Table  18:  Performance  of  varions  t  fnnctions  on  the  English- French  test  corpns. 
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recall 


Figure  17:  Performance  of  the  SLCP  dictionary,  translation  model  (with  and  without 
weights)  and  the  two  combined. 
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Figure  18:  Performance  of  the  HSCR  (0.2  threshold)  t  function  with  and  without  weights. 
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The  unweighted  translation  model  and  the  unweighted  HSCR  t  functions  were  merged 
to  create  a  new  t  function  such  that  )  =  f  if  Pr(e, /)  >  0  or  HSCR(e,/)  >  0.2  and 
0  otherwise.  The  result  contained  348,724  entries.  This  t  function  offered  precision  and 
recall  of  0.99  under  both  competitive  linking  and  maximum  weighted  bipartite  matching. 
The  precision-recall  plots  showing  this  in  comparison  to  both  of  the  components  merged 
are  shown  in  Figure  19. 
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Figure  19:  Performance  of  the  HSCR  (0.2  threshold)  t  function  combined  with  the  un¬ 
weighted  translation  model. 


Taking  the  union  of  the  SLCP  dictionary,  the  unweighted  translation  model,  and  the 
unweighted  cognate  classiher  (a  total  of  352,739  non-zero  entries)  yielded  precision  of  1  with 
recall  of  1  under  both  competitive  linking  and  maximum  weighted  bipartite  matching.  (This 
result  is  not  shown  in  any  of  the  precision-recall  plots.)  In  this  particular  instance,  combining 
three  sources  of  information  in  the  simplest  way  (union)  yielded  perfect  classihcation. 

ft  can  be  concluded  that  even  noisy  cognate  classihers  are  useful  for  translation  detection. 

7.6  Performance  at  the  Document  Level:  Comparison  with  STRAND 

One  of  the  potential  apphcations  for  this  technique  discussed  in  Section  1.1  was  using 
content-based  scores  of  translational  equivalence  to  boost  performance  of  the  STRAND 
system  [Res99],  which  uses  structural  information  only  to  locate  translation  pairs  at  the 
document  level.  Here  1  offer  preliminary  results  that  show  the  capability  of  the  resemblance 
scoring  technique  to  classify  document  pairs  by  translational  equivalence  in  comparison  with 
strand’s  capability. 

Resnik  [Res99]  ran  an  experiment  with  two  human  judges  in  which  STRAND  classi- 
hcations  of  candidate  document  pairs  were  compared  to  human  ratings  of  translational 
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equivalence.  In  aU,  there  were  233  English-French  document  pairs^^  for  which  the  human 
judges  agreed  (see  Table  19  for  data  on  these  pairs).  Of  these,  84  were  ruled  as  translation 
pairs,  and  the  other  149  were  ruled  as  inequivalent.  The  STRAND  system  correctly  identi- 
hed  66  of  the  84  good  pairs  as  translations  (i.e.,  recall  =  0.7857),  with  precision  of  0.9429. 
The  STRAND  result  is  used  here  as  a  benchmark. 


size 

(pairs) 

English 

tokens 

English 

types 

Erench 

tokens 

Erench 

types 

mean  English 
segment  size 

mean  Erench 
segment  size 

good 

84 

56,388 

21,378 

64,730 

28,492 

671.29 

770.60 

bad 

148 

180,912 

30,247 

120,810 

33,940 

1222.38 

816.28 

all 

232 

237,300 

34,470 

185,540 

38,318 

1022.84 

799.74 

Table  19:  Enghsh-French  STRAND  candidate  pairs.  Thanks  to  Philip  Resnik  for  use  of  the 
documents  and  details  of  the  STRAND  experiment.  The  English  and  French  text  was  tok- 
enized  using  scripts  included  in  the  Egypt  distribution  [WS99],  written  by  Dan  Melamed  and 
Yaser  Al-Onaizan.  “Good”  refers  to  candidate  pairs  agreed  to  be  translationally  equivalent 
by  the  human  judges;  “bad”  refers  to  the  candidate  pairs  agreed  not  to  be  translationally 
equivalent  by  the  human  judges. 


The  base  t  functions  used  in  this  experiment  were: 

•  An  SLCP  dictionary  derived  from  the  same  original  dictionary  in  Section  7.5,  retaining 
all  relevant  entries  (total  size  was  7,637  entries). 

•  Method  A  tranlsation  model  from  Section  7.5  with  weights. 

•  Method  A  tranlsation  model  from  Section  7.5  without  weights. 

•  HSCR  (cognate-ness)  scores  were  computed  for  every  word  co-occurrence  pair  in  the 
set  of  candidate  documents^^  using  the  m  function  from  Section  7.5.  As  before,  these 
were  thresholded  at  HSCR  0.2;  this  left  36,137  pairs  for  which  f  >  0.  The  t  function 
was  dehned  as  f(e,  /)  =  HSCR(e,  /)  if  HSCR(e,  /)  >  0.2  and  0  otherwise. 

•  A  boolean  t  function  derived  from  the  HSCR:  t{e,  f)  =  1  if  HSCR(e,/)  >  0.2  and  0 
otherwise. 

The  genre  of  the  translation  model  training  data  (biblical  verses)  is  in  sharp  contrast 
to  the  genre  of  the  candidate  documents  (World-Wide  Web  documents),  so  it  is  expected 
that  the  coverage  of  the  translation  model  t  function  will  be  limited  for  this  purpose.  For 
this  reason  cognates  are  expected  to  increase  the  performance  of  the  translation  model. 

Table  20  shows  some  sample  word  pairs  from  the  cross-product  of  words  in  the  English 
and  French  documents  and  their  HSCR  scores.  Note  that  English  words  appear  in  French 
text  and  vice  versa,  and  that  Web  text  often  contains  errors.  Capital  letters  tend  to  receive 

^^For  this  experiment,  one  pair  was  not  used  because  it  contained  a  great  deal  of  non-linguistic  material 
(i.e.,  a  compressed  file),  resulting  in  an  excessively  high  number  of  “terms,”  most  of  which  were  not  words 
but  pieces  of  compressed  data.  This  pair  was  ruled  as  “bad”  by  both  judges  and  STRAND. 

small  error  was  made  in  this  process.  The  minumum  length  of  a  French  type  was  computed  to  be 
5,  but  because  of  a  small  error  in  the  code,  only  pairs  involving  French  words  of  length  6  or  greater  were 
actually  used.  At  the  time  of  printing,  there  was  not  time  to  correct  this;  it  stands  to  reason  that  more 
cognates  would  be  found  and  used  if  the  error  was  corrected.  This  error  did  not  affect  cognate  scoring  in 
the  experiments  described  in  Section  7.5. 
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high  m  scores  because  they  are  less  common  than  lower-case  letters,  so  many  all-capital 
words  receive  high  HSCR  scores  even  though  they  are  not  cognates. 


rank 

HSCR 

English 

Erench 

1 

0.8315 

CANADA 

CANADA 

2 

0.7926 

COMMAND 

COMMAND 

3 

0.7871 

NACIONAL 

NACIONAL 

4 

0.7812 

ANGLAIS 

ANGLAIS 

201 

0.5238 

MAISON 

ACTION 

202 

0.5228 

VEHICLES 

VEHICULE 

203 

0.5226 

ARGENTINA 

ARGENTINES 

204 

0.5225 

BORDEAUX 

BORDEAUX 

1,001 

0.3595 

Bettina 

Bettina 

1,002 

0.3595 

Corporations 

Corporation 

1,003 

0.3595 

Corportation 

Corporation 

1,004 

0.3595 

VERSION 

EDITION 

8,001 

0.2549 

BASICLY 

AIDENT 

8,002 

0.2549 

TRAEEIC 

AIDENT 

8,003 

0.2549 

Limited. 

Limitee 

8,004 

0.2549 

Walter 

Walter 

20,001 

0.2174 

information 

orientation 

20,002 

0.2174 

Panama 

Paraphe 

20,003 

0.2174 

Politicised 

politiciens 

20,004 

0.2173 

Cherney 

Culture 

36,134 

0.2000 

LABELLING 

SOMMAIRE 

36,135 

0.2000 

associate 

soixante 

36,136 

0.2000 

Choosing 

Coordonner 

36,137 

0.2000 

transition 

association 

Table  20:  Examples  of  Enghsh-French  cognate-ness  (HSCR)  scores.  These  examples  were 
chosen  arbitrarily  to  show  a  range  of  scores  and  noisiness. 


7.6.1  Structure  vs.  Content 

The  candidate  generation  component  of  STRAND  selects  candidate  document  pairs  rather 
than  candidate  documents.  This  means  that  it  is  unnecessary  to  score  the  cross-product 
of  English  documents  with  French  documents;  only  the  232  candidate  pairs  needed  to  be 
scored.  Competitive  linking  degenerates  to  simple  thresholding  by  resemblance  score  in  this 
scenario,  since  there  is  no  “competition”  among  documents  of  the  same  language. 

In  order  to  test  content-based  comparison  only,  structural  information  (i.e.,  HTML 
tags)  was  completely  removed  from  the  documents.  The  only  preprocessing  applied  was 
tokenization. 

Results  are  reported  as  precision-recall  plots.  Each  plotted  point  corresponds  to  a 
threshold.  The  highest  thresholds  offer  high  precision  and  low  recall,  and  as  the  threshold 
decreases,  precision  also  decreases  while  recall  increases.  The  points  in  the  upper  left 
corner  of  each  plot  correspond,  then,  to  the  highest  thresholds.  The  STRAND  benchmark 
is  marked  as  a  reference  point. 

The  SLCP  dictionary’s  performance  (see  Figure  20)  came  very  close  to  that  of  STRAND, 
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Figure  20:  Performance  of  the  SLCP  dictionary  as  compared  to  STRAND. 


but  did  not  surpass  it.  This  is  rather  surprising,  considering  this  dictionary’s  dismal  perfor¬ 
mance  on  the  Blinker  Bible  corpus  in  the  experiment  reported  in  Section  7.5.  ft  is  possible 
that  the  absence  of  accents  in  the  dictionary  was  less  of  a  problem  in  this  case,  since  the 
documents  themselves  may  lack  accents. 

As  before,  the  performance  of  the  unweighted  translation  model  was  signihcantly  greater 
than  that  of  the  weighted  one  (see  Figure  21).  Neither,  however,  obtained  the  level  of 
performance  of  STRAND  or  the  SLCP  dictionary.  This  is  not  surprising,  given  the  genre 
difference  of  the  training  and  test  data. 

Another  likely  effect  is  the  size  of  the  documents.  Recall  that  the  resemblance  score 
does  not  take  into  account  word  type  frequency  in  a  text  sample.  A  word  type  may  be 
present  once  or  one  hundred  times  in  a  document,  and  that  document’s  resemblance  score 
with  any  other  document  will  remain  the  same.  This  is  more  important  at  the  document 
level  than  the  sentence  level,  since  documents  are  more  likely  to  contain  many  duplicates 
of  word  types,  some  of  which  may  be  highly  informative.  Further  investigation  is  necessary 
to  determine  the  robustness  of  this  technique  to  candidate  text  sample  size. 

As  with  the  translation  model,  the  cognate  t  function  performed  better  without  weights 
than  with  weights  (see  Figure  22).  The  cognate  classiher  is  not  terribly  useful  on  its  own, 
but  given  that  only  a  small  subset  of  word  pairs  will  actually  be  cognates,  it  is  expected  to 
serve  mainly  to  supplement  other  t  functions  that  are  domain-limited. 

ft  might  be  expected,  for  example,  that  the  cognate  classiher  would  complement  the 
translation  model,  adding  additional  word  pairs  that,  though  not  in  the  Bible  corpus,  are 
useful  for  Web  documents.  Such  cognates  might  include  proper  names  and  technical  terms. 
1  tested  the  performance  of  the  union  of  the  two,  and  it  performed  consistently  worse  than 
the  translation  model  alone,  ft  is  unclear  why  this  would  be  the  case;  further  investigation 
is  required  to  understand  what  is  happening  in  this  situation. 
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Figure  21:  Performance  of  the  translation  model  with  and  without  weights. 
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Figure  22:  Performance  of  a  cognate  classiher  with  and  without  weights,  in  comparison  to 
STRAND. 


Finally,  the  SLCP  t  function  was  merged  with  the  unweighted  cognate  classiher  (yield¬ 
ing  43,485  non-zero  entries),  and  also  with  both  the  unweighted  cognate  classiher  and 
unweighted  translation  model  (yielding  52,561  non-zero  entries). 
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Figure  23:  Performance  of  the  SLCP  dictionary  compounded  with  the  unweighted  cognate 
classiher,  and  with  both  the  Method  A  translation  model  without  weights  and  the  cognate 
classiher  in  comparison  to  STRAND. 


As  shown  in  Figure  23,  this  method  outperforms  STRAND  when  the  t  function  is  the 
union  of  the  SLCP  dictionary  with  the  cognate  classiher  (unweighted).  There  are  slight 
improvements  beyond  that  when  the  translation  model  is  added  in  as  welP^.  By  adding  in 
cognates,  the  number  of  non-zero  pairs  for  t  was  increased  to  43,485,  an  increase  of  522%. 
Adding  the  translation  model  entries  as  well  resulted  in  52,561  pairs.  Table  21  shows  this 
t  function’s  precision  and  recall  at  STRAND’s  levels  of  recall  and  precision. 


technique 

precision 

recall 

STRAND  [Res99]  (structure) 

0.9429 

0.7857 

resemblance  scoring  (content) 

0.9565 

0.9429 

0.7857 

0.8571 

Table  21:  Content-based  translation  detection  wins  out  over  structure-based  translation 
detection. 


8  Future  Work 

A  number  of  extensions  to  this  research  may  be  suggested;  some  are  described  briehy  here. 
^®The  translation  model  added  to  the  SLCP  dictionary  by  itself  did  not  resnlt  in  any  improvement. 
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8.1  Theoretical  modifications 


One  of  the  key  shortcomings  of  the  resemblance-scoring  method  described  here  is  its  failnre 
to  take  into  acconnt  mnltiple  occnrrences  of  a  word  type  in  a  text  sample.  This  information 
conld  be  important,  for  example,  when  dealing  with  samples  on  the  same  topic  in  which  a 
word  type  is  nsed  freqnently,  sometimes  more  than  once  in  a  sample.  The  nnmber  of  times 
that  word  type  is  seen  conld  be  an  indicator  of  its  importance  in  a  sample;  that  information 
in  tnrn  shonld  be  matched  with  samples  in  the  other  langnage. 

The  techniqne  presented  here  views  text  samples  as  sets  of  words.  1  did  not  explore 
the  effectiveness  of  shingles  of  size  greater  than  f,  since  dictionaries  and  translation  models 
will  not  generally  be  nsefnl  for  the  straightforward  derivation  of  t  fnnctions  over  pairs  of 
snch  shingles,  ft  might  be  worthwhile  to  examine  ways  to  exploit  snch  (mostly)  one-to-one 
resonrces  to  develop  bigram  or  trigram  t  fnnctions. 

Along  the  same  lines,  1  did  not  nse  a  generative  probability  model  for  pairs  of  word- 
sets,  mainly  becanse  of  the  problem  of  exponential  decay  in  probability  of  an  item  as  length 
increases.  Eisner  [EisOf]  snggested  an  approach  to  this  which  might  be  considered  in  fnrther 
work.  By  constrncting  a  model  M  for  translationally  eqnivalent  pairs  and  a  model  M  for 
other  pairs,  the  probability  that  a  pair  (E,F)  was  generated  by  M  is: 


Pr(M|E,F) 


Pr(E,F|M)-Pr(M) 

Pr(E,  F\M)  •  Pr(M)  +  Pr(E,  FlW)  •  (f  -  Pr(M)) 


8.2  Enhancements  and  Further  Experimentation 

The  noisy  scenario  in  Section  7.4  is  one  likely  to  be  enconntered  in  many  applications. 
Freqnently  the  amonnt  of  noise  present  (or,  eqnivalently,  the  nnmber  of  translationally 
eqnivalent  pairs  present)  in  the  data  to  be  processed  wiU  be  nnknown.  A  generic  means 
to  estimate  the  amonnt  of  noise  or  a  threshold  for  the  r  threshold  wonld  be  nsefnl  in  snch 
scenarios. 

ft  might  be  frnitfnl  to  explore  ways  to  deal  with  polysemy,  snch  as  word-sense  disam- 
bignation  as  a  preprocessing  step. 

1  did  not  carry  ont  extensive  testing  of  the  robnstness  of  this  techniqne  to  text  sample 
size.  This  conld  be  done  relatively  easily  nsing  an  aligned  parallel  corpns  by  segmenting  it 
at  mnltiple  levels  of  grannlarity  and  evalnating  the  precision  and  recall  at  different  sample 
sizes  and  different  levels  of  variabihty  in  sample  size. 

The  containment  score  c  was  not  nsed  at  aU  in  this  techniqne.  This  score  estimates  the 
extent  to  which  one  sample  is  “contained”  in  the  other,  ft  is  possible  that  this  score  might 
be  exploited  in  some  way. 

One  direction  that  might  lead  to  increased  performance  is  to  consider  an  iterative  frame¬ 
work  in  which  the  classiher  is  nsed  to  select  the  nnlabeled  pairs  it  is  most  conhdent  are 
translations,  and  then  these  pairs  are  added  to  the  training  corpns.  More  robnst  empirical 
classihers  (translation  models  and/or  cognate  classihers)  might  then  be  learned  from  the 
enhanced  corpns.  Similar  techniqnes  have  proven  snccessfnl  in  work  by  Yarowsky  [Yar95] 
and  Blnm  and  Mitchell  [BM98]. 

Other  than  the  STRAND  evalnation,  this  research  involved  no  rehance  on  hnman  jndge- 
ment.  Althongh  the  aligned  texts  nsed  may  be  considered  reliable,  some  pairs  “wrongly” 
matched  might  be  considered  close  in  meaning  by  hnman  jndges.  Fnrther  evalnation  involv¬ 
ing  bilingnal  speakers  wonld  more  strongly  snpport  the  positive  hndings  presented  here. 
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It  goes  without  saying  that  better  t  functions  are  to  be  obtained,  perhaps  empirically  like 
the  translation  models.  A  key  feature  in  such  functions  is  their  ability  to  hold  up  to  noisy 
conditions.  Interestingly,  my  hndings  suggest  that  neither  the  size  of  the  t  function  (i.e., 
number  of  non-zero  entries)  nor  weights  (i.e.,  a  non-boolean  function)  offer  performance 
advantages. 

Most  importantly,  better  means  of  combining  multiple  sources  of  word  translational 
equivalence  information  need  to  be  considered.  In  many  cases,  combining  two  high-perform¬ 
ing  boolean  t  functions  (through  union)  did  not  result  in  a  signihcantly  better  t  function, 
even  though  the  intersection  was  small.  It  is  unclear  why  this  is  the  case,  but  clearly  there 
can  be  benehts  of  combining,  for  example,  a  dictionary  with  cognates. 

In  particular,  an  interesting  direction  would  be  to  combine  the  content-based  equiva¬ 
lence  information  offered  by  this  technique  with  the  structure-based  equivalence  information 
used  by  systems  like  STRAND.  It  is  likely  that  using  cross-lingual  resemblance  scoring  in 
tandem  with  structural  comparison  would  increase  the  capabilities  of  parallel  text  mining 
applications. 

9  Conclusions 

This  discussion  has  presented  a  general  algorithm  which  can  reliably  classify  whether  two 
pieces  of  text  are  translations  of  each  other.  To  my  knowledge,  no  such  technique  has  been 
developed  based  on  text  content  alone.  This  technique  has  shown  success  on  three  language 
pairs:  English- Chinese,  English- Spanish,  and  English-French. 

Because  many  scenarios  are  likely  to  involve  large  search  spaces  for  translational  pairs, 
I  have  offered  a  hlter  to  hmit  the  search  without  affecting  performance.  I  have  shown 
that,  depending  on  the  resources  used  to  dehne  word-to-word  translational  equivalence, 
this  approach  can  be  robust  to  noise.  Another  useful  hnding  is  that,  in  this  framework, 
automatically-learned  translation  models  (learned  from  parallel  text)  and  cognate- scoring 
functions  (learned  from  translation  models  or  dictionaries)  can  in  some  instances  be  used 
either  on  their  own  or  as  supplements  to  a  priori  resources  (i.e.,  electronic  bilingual  dictio¬ 
naries)  to  achieve  excellent  performance. 

Finally,  I  have  shown  that  a  content-based  classiher  of  translational  equivalence  can 
compare  favorably  to  one  that  uses  only  structural  information.  It  is  expected  that  the 
technique  presented  here  will  prove  useful  in  supplementing  systems  for  parallel  corpus 
construction  and  other  multilingual  tasks. 
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